Concepts
A rate control algorithm dynamically adjusts encoder parameters to achieve a target bitrate. It allocates a budget of bits to each group of pictures, individual picture and/or sub-picture in a video sequence. Rate control is not a part of the H.264 standard, but the standards group has issued non-normative guidance to aid in implementation. The purpose of this white paper is to offer 1) a basic understanding of what rate control is and why it is essential and 2) a common framework and terminology so that schemes originating from H.264 and other standards groups can be more easily understood and compared.
Block-based hybrid video encoding schemes such as the MPEG [1,2] and h.26* [3] families are inherently lossy processes. They achieve compression not only by removing truly redundant information from the bitstream, but also by making small quality compromises in ways that are intended to be minimally perceptible. In particular, the quantization parameter QP regulates how much spatial detail is saved. When QP is very small, almost all that detail is retained. As QP is increased, some of that detail is aggregated so that the bit rate drops – but at the price of some increase in distortion and some loss of quality. Figure 1 suggests that relationship for a particular input picture – if you want to lower bit rate, you can do so by lowering QP at a cost of increased distortion. Figure 2 suggests that as source complexity varies during a sequence, you move from one such curve to another.
Figure 1. For a particular source frame | Figure 2. But when source complexity varies…. |
Figure 3 illustrates open loop (or VBR) operation of a video encoder. The user supplies two key inputs – the uncompressed video source and a value for QP. As the source sequence progresses, you will get compressed video of fairly constant quality, but the bitrate may vary dramatically. Because the complexity of pictures is continually changing in a real video sequence, it is not so obvious what value of QP to pick. If you fix QP for an "easy" part of the sequence having slow motion and uniform areas, then the bit rate will go up dramatically when you reach the "hard" (i.e., more complex) parts.
In reality, constraints imposed by decoder buffer size and network bandwidth force us to encode video at a more nearly constant bitrate. To do this, Figure 4 suggests that we must dynamically vary QP based upon estimates of the source complexity, so that each picture (or group of pictures) gets an appropriate allocation of bits to work with. Rather than specifying QP as input, the user specifies demanded bitrate instead.
Figure 3. Open Loop Encoding (VBR) | Figure 4. Closed Loop Rate Control (CBR) |
Elements of H.264 Rate Control
With a focus on the recommended approach [4, 5, 6] for H.264, Figure 5 identifies important elements within the rate controller. Most of these elements are common to other rate control schemes. Note that Figure 5 is conceptual and is not a literal representation of any software implementation. Many details are glossed over – for example, that B and P pictures are treated differently, and that some estimates are averages of sampled data over multiple pictures.
Figure 5. Elements of H.264 Rate Controller |
Rate-Quantization Model
The heart of the algorithm is a quantitative model describing Figure 2 — the relationship between QP, actual bitrate and a surrogate for encoding complexity. However, the bits and complexity terms should be associated only with the residuals. Why?? Because the quantization parameter QP can only influence the detail of information carried in the transformed residuals. QP has no direct effect on the bitrates associated with overhead,prediction data, or motion vectors. The Mean Average Difference (or MAD) of the prediction error is used for this purpose.
The model takes an algebraic form such as
ResidualBits = C1 * MAD / QP + C2 * MAD / QP |
but it may take a simpler form (with C2 = 0) or a more complicated form involving exponentials or other basis curves for fitting. This equation [note that our term ResidualBits is synonomous with the term Texture Bits used by other authors [2]] corresponds to equation 2-84 of [6] and to equation 1 of [2]. The free coefficients C1 and C2 may be estimated empirically, by providing hooks in the encoder for extracting the residual coefficients, as well as the number of residual bits needed to transmit them.
Having established the model in (2), we can solve for the demanded QP when the target value of ResidualBits is supplied by the Bit Allocationmodules in Figure 5.
Complexity Estimation
As indicated above, we need a simple metric that reflects the encoding complexity associated with the residuals. The MAD of the prediction error is a convenient surrogate for this purpose:
This MAD is an inverse measure of predictor's accuracy and (in the case of interprediction) the temporal similarity of adjacent pictures.
Ideally, the MAD would be estimated after encoding the current picture, but that would require us to encode the picture again after the QP is selected – quite a burden for a computationally intensive standard like H.264! Instead, we can usually assume that this complexity surrogate varies gradually from picture to picture, and estimate it based upon data extracted from the encoder for previous pictures. Note that this assumption fails at a scene change.
QP-Limiter
Figures 4 and 5 represent a closed loop control system which must be appropriately damped to guarantee stability and to minimize perceptible variations in quality. For difficult sequences having rapid changes in complexity, QP-demand may oscillate noticeably, so a rate limiter is applied which typically limits changes in QP to no more than ± 2 units between pictures.
Virtual Buffer Model
Any compliant decoder is equipped with a buffer to smooth out variations in the rate and arrival time of incoming data. The corresponding encoder must produce a bitstream that satisfies constraints of the decoder, so a virtual buffer model is used to simulate the fullness of the real decoder buffer.
The change in fullness of the virtual buffer is the difference between the total bits encoded into the stream, less a constant removal rate assumed to equal the bandwidth (or demanded bitrate). The buffer fullness is bounded by zero from below and by the buffer capacity from above. The user must specify appropriate values for buffer capacity and initial buffer fullness, consistent with the decoder levels supported.
QP Initializer
QP must be initialized upon start of video sequence. An initial value may be input manually, but a better approach is to estimate it from the demanded bits per pixel, i.e.,
DemandedBitsPerPixel = DemandedBitrate / (FrameRate * height * width)
Equation 2-67 of [6] provides a recommended table relating initial QP to DemandedBitsPerPixel.
GOP Bit Allocation
Based upon the demanded bit rate and the current fullness of the virtual buffer, a target bit rate for the entire group of pictures (GOP) is determined, and QP for the GOP's I-picture and first P-picture is also determined.
The GOP Target is fed into the next block for detailed bit allocation to pictures or to smaller basic units.
Basic Unit Bit Allocation
The "Basic Unit" is useful terminology introduced in [4], which is the basis for H.264 rate control recommendations [6]. With this approach, scalable rate control may be pursued to different levels of granularity – such as picture, slice, macroblock row or any contiguous set of macroblocks. That level is referred to as a "basic unit" at which rate control is resolved, and for which distinct values of QP are calculated.
If the basic unit is smaller than a picture, then this block in Figure 5 actually breaks out into two layers – one for the picture itself and another for the basic unit. Figure 5 and our discussion are limited to the case where the picture itself is the basic unit. For details on how to treat smaller basic units, please see [5] or [6].
For H.264, the emphasis is on computing QP for each stored picture (usually a P-picture)[Strictly speaking, the H.264 standard allows B pictures to be used as reference pictures. However, this is not expected to be common usage.]. The QP's for non-stored pictures (ordinarily B-pictures) are then interpolated (and offset) from QP values for their neighboring P pictures. First, considering the MAD of the picture, one can determine a target level for the buffer fullness. Then using the buffer target level, it is easy to calculate the target bits for the picture.
Comparison with MPEG-2 (Test Model 5) Rate Control
Because of the influence and familiarity of MPEG's Test Model 5 rate control [7], it is useful to compare its similarities and differences with the H.264 approach. To do so, we transmogrify Figure 5 into Figure 6, which corresponds conceptually to the MPEG2/TM5 approach.
Figure 6. Comparison to MPEG2 Test Model 5 |
Similarities include the use of the virtual buffer model, the calculation of layered bit targets for the GOP and picture, and the overall goal of generating a quantization parameter (in this case, called Mquant) for a basic unit. The Mquant for the basic unit (always a single macroblock) is adjusted in proportion to its estimated complexity.
Differences include:
• | The Basic Unit is always the macroblock in this scheme. It is possible to get significant variations of quantization parameter across different macroblocks in the same picture |
• | Differences between I, P and B picture types arise in the allocation of target bits. Otherwise, they are treated similarly. |
• | MPEG-2 does not have the same multiplicity of prediction modes. In the absence of advanced intra prediction, it need not be so rigorous in relating quantization parameter (which controls residual quality) to measured properties of the residual itself. |
• | Macroblock-level spatial complexity is estimated from the source activity, regardless of whether the complexity is handled by transmitting motion vectors (inter-prediction) or residual coefficients. |
• | Allocation of bits to a picture considers the picture type, GOP structure and demanded bitrate, but not the picture's measured complexity. However, within the picture, the buffer fullness and relative spatial activity of each macroblock is used to allocate the picture bits among the macroblocks. |
It is easy to recognize this Test Model 5 approach as an ancestor of the H.264 approach, which accommodates the more general prediction methods of H.264 and provides more flexibility to scale the granularity of control.
H.264 Rate-Distortion Optimization and Global Rate Control
H.264 provides 7 modes for inter (temporal) prediction, 9 modes for intra (spatial) prediction of 4x4 blocks, 4 modes for intra prediction of 16 x 16 macroblocks, and one skip mode. Each 16 x 16 macroblock can be broken down in numerous ways. Thus, mode selection for each macroblock is a critical and time-consuming step that enables much of the dramatic bitrate reduction.
Selection of the optimal mode is done by an algorithm called rate-distortion optimization (RDO) [8], which essentially involves 1) an exhaustive pre-calculation of all feasible modes to determine the bits and distortion of each; 2) evaluation of a metric that considers both bitrate and distortion; and 3) selection of the mode that minimizes the metric.
QP is input to the RDO process, which does not regulate QP or modify the quality of the residual coefficients. RDO is complementary to rate control; these two aspects of the problem are decoupled because a fully coupled optimization would require a more expensive iterative solution.
The interplay with RDO, described in [4] as a "chicken and egg" dilemma, influences implementation of a rate control algorithm. The MAD is needed by the rate control algorithm, but it is available only after the RDO has used a QP value to generate it. Thus, the rate control algorithm must use an estimate for MAD based upon complexity of prior pictures in the sequence.
ExpertH264 Implementation of Rate Control
PixelTools has implemented the H.264 rate control recommendations in a recent release of ExpertH264. For this release, we have provided picture level control without frame skip. Especially for offline applications for encoding to stored media, this algorithm provides excellent tracking of bitrates for GOPs of a wide variety of sizes.
Typical results track GOP bitrate within 1% without B pictures or 2-3% with B pictures, with good stabilization of QP to prevent noticeable swings in quality. You can try this for yourself by requesting a free demo of ExpertH264 from PixelTools Corporation.
In subsequent releases, we plan to allow flexibility for smaller basic units, which will allow closer bitrate tracking on the individual picture level, as well as for smaller virtual buffer capacities. We will also support both frame skip and stuffing bits in a subsequent release – depending upon the end requirements, use of one or both of these techniques will reduce variations in bitrate.
The algorithm is a separate module having several interfaces that can be called by the encoder, and with callbacks to the encoder for retrieving key information such as residual bits and residual coefficients. Construction of the complexity metric (i.e., prediction error MAD) is part of the rate control algorithm. C Interfaces and utility functions include:
• init_rateControl | • frameRateControl | • updateBFrameState |
• initRateControlParams | • getQB | • getMbMAD |
• gopRateControl | • updateModel | • initialQP |
Thus, developers of hardware and software encoders can consider integrating this algorithm into their own environments. For example, after the encoding step, a call to updateModel refreshes the empirical coefficients such as C1 and C2 in equation (2). Similarly frameRateControl is called prior to encoding each picture and supplies the quantization parameter.
Terminology
The following glossary is intended to help with a common understanding of rate control issues.
Prediction. Both H.264 and MPEG-* may predict a macroblock by traditional inter (temporal) prediction, i.e., a motion estimation from previous reference pictures followed by transmission of the motion vector. Additionally, H.264 supports advanced intra (spatial) prediction of a macroblock from encoded values for neighboring pixels that have already been encoded (e.g., in raster-scan order).
Residual. The difference between the source and prediction signals is called the residual, or the prediction error. A spatial transform is then applied to the residual to produce transformed coefficients that carry any spatial detail that is not captured in the prediction itself or its reference pictures.
Distortion. Distortion refers to the difference between the original source image x, and the reconstructed image y after it has been decoded. In H.264, sum of squared difference is used to quantify distortion as (1/N)
i |yi – xi |2, for any set of N pixels.
Complexity. As the saying goes, I can't define complexity, but I know it when I see it! A single source picture is complex if it is "busy" and has lots of spatial detail. The term spatial activity is synonymous with source complexity for this case. However, for a video sequence, the meaning of complexity is, well, more complex! For example, if a video sequence consists of one busy object that translates slowly across the field of view, it may not require very many bits because the temporal prediction can easily capture the motion using a single reference picture and a series of motion vectors. It is difficult to define an inclusive video complexity metric that is also easy to calculate. See MAD
MAD: Mean Absolute Difference of Prediction Error. For rate control, what is more important is the encoding complexity of the residuals that are left over after the inter or intra prediction process is finished. The Mean Absolute Difference of Prediction Error is usually closely related to encoding complexity. Suppose xi is the source value for ith pixel, then:
Spatial Activity. This term is used to quantify the amount of spatial variation within a part of the picture, normally a block of N pixels. Suppose the N pixel values xi, i = 1,..,N. Then the activity for those N pixels is: (1/N)
i (xi – <x> )2, where <x> = (1/N)
i xi. In other words the spatial activity is the sample variance of a block's values. It is the measure for local complexity used in MPEG-2.
Bitrate. Bitrate refers to the bits per second consumed by a sequence of pictures, i.e., bitrate = (average bits per picture) / (frames per second). In practice, it is equated to the reliable network bandwidth that is provisioned or available for the stream.
Quantization Parameter (QP). Residuals are transformed into the spatial frequency domain by an integer transform that approximates the familiar Discrete Cosine Transform (DCT). The Quantization Parameter determines the step size for associating the transformed coefficients with a finite set of steps. Large values of QP represent big steps that crudely approximate the spatial transform, so that most of the signal can be captured by only a few coefficients. Small values of QP more accurately approximate the block's spatial frequency spectrum, but at the cost of more bits. In H.264, each unit increase of QP lengthens the step size by 12% and reduces the bitrate by roughly 12%.
Group of Pictures (GOP). The Group of Picture concept is inherited from MPEG and refers to an I-picture, followed by all the P and B pictures until the next I picture. A typical MPEG GOP structures might be IBBPBBPBBI. Although H.264 does not strictly require more than one I picture per video sequence, the recommended rate control approach does require a repeating GOP structure to be effective. Thus, H.264 rate control will not work properly if the IntraPeriod parameter is set to 0.
Basic unit. The authors of references [4] and [5] introduced this useful term that expresses the granularity on which QP is adjusted in the feedback control loop. If the basic unit is a picture, then the rate controller's adjustments to QP are uniform across the picture. In MPEG-2, the basic unit is a macroblock. Initially, most H.264 applications will probably use the picture as basic unit, but ultimately a full or partial row of macroblocks is expected to yield the best compromise between uniform bitrate and uniform quality.
Summary
This white paper presents the basics of rate control for H.264 and compares them to the Test Model 5 approach of MPEG-2. Implementers needing a detailed description of the algorithm should see [5] or [6]. The structure shown in our Figure 5, the discussion of its modules, and the terminology glossary should provide a useful companion to help in understanding the densely packed equations found in these references.
References
1. C. Poynton, Digital Video and HDTV, Elsevier Science 2003, pp. 491-2
2. A. Vetro, "MPEG-4 Rate Control for Multiple Video Objects," IEEE Transactions on Circuits and Systems for Video Technology," Vol. 9, No. 1, February 1999
3. G. Sullivan, T. Wiegand and K.P. Lim, "Joint Model Reference Encoding Methods and Decoding Concealment Methods; Section 2.6: Rate Control" JVT-I049, San Diego, September 2003
4. Z. Li et al., "Adaptive Basic Unit Layer Rate Control for JVT," JVT-G012, 7th Meeting: Pattaya, Thailand, March 2003
5. Z. Li et al., "Proposed Draft of Adaptive Rate Control," JVT-H017, 8th Meeting: Geneva, May 2003
6. G. Sullivan, T. Wiegand and K.P. Lim, "Joint Model Reference Encoding Methods and Decoding Concealment Methods; Section 2.6: Rate Control" JVT-I049, San Diego, September 2003
7. MPEG 2 Test Model 5, Rev. 2, Section 10: Rate Control and Quantization Optimization, ISO/IEC/JTC1SC29WG11, April 1993
8. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini and G. Sullivan, "Rate-Constrained Coder Control and Comparison of Video Coding Standards," IEEE Transactions on Circuits & Systems for Video Technology, 13, #7, July 2003