next up previous contents index
Next: Audio Compression and Codecs Up: Standard Video Codecs Previous: H.263 Video Coding   Contents   Index


Overview of MPEG-1/2/4 Coding

MPEG is an abbreviation for Moving Picture Expert Group. There are three standards already completed by MPEG, namely MPEG-1, MPEG-2, and MPEG-4. Both MPEG-1 and MPEG-2 standards are similar in basic concepts. They both are based on motion compensated block-based transform coding techniques similar to those employed in H.263. However, MPEG-4 employs more sophisticated approaches including the use of software image construct descriptors, for target bit-rates in the very low range, $\leq$ 64Kb/sec (as described in the initial specifications of MPEG-4). However, the success of MPEG-4 and the impractical ability to fit the bit rate below 64Kb/sec, make it possible now to use it for higher bit rates. MPEG uses the notation of layers. This is to help with error handling, random search and editing, and synchronization, for example with an audio bitstream. The top layer in the hierarchy is the video sequence layer. It is a self-containing bitstream such as a movie or a football game. The second layer down is the group of pictures (GOP), which is composed of 1 or more groups of Intra (I) frames and/or non-Intra (P and/or B) pictures. The most famous GOP is IBBPBBPBBPBB...The third layer is the picture layer, the processing of each kind of these pictures. The next layer is the slice layer, which is a contiguous sequence of macroblocks. Each macroblock is 16x16 arrays of luminance pixels, and two 8x8 arrays of associated chrominance pixels. The macroblocks can be further divided into distinct 8x8 blocks, for further processing such as transform coding. Intra frames (I for short) are encoded without any temporal compression techniques. Only lossy and lossless coding are employed on the current picture without any reference to the adjacent frames. On the other side, Inter frames (P or B) are encoded by taking into account the motion prediction techniques (removing the temporal redundancy among frames) as well as those employed to encode I frames. Hence, the bit rate of I frame are very large compared to both P and B frames. In a GOP, each P frame is predicted from either I or P frames immediately preceding it. Thus temporal compression techniques are employed between these frames to generate the P frames. The major disadvantage of using P frames is the error propagation effect. Any error (transmission data loss) in any I frame will propagate to all the pictures in the GOP, as all the P pictures are based on I frames. One of the advantages of H.263 and H.263+ codecs is that they encode P frames from the previous P frames, and choose randomly (in a cyclic way) some macroblocks in each P frames to encode them completely without any motion prediction (only spatial redundancy are removed). The advantage of doing that is to generate near constant bit rate and maintain good quality with sufficient error resiliency. Knowing that (in average the size of I frame is at least the double of P frame). Hence, quality smoothing techniques against bit rate in real-time video is easier and more powerful in H.26x codecs than MPEG codecs. Regarding the B frames (bi-directional interpolated prediction frames), they are generated by using forward/backward interpolated prediction. For a simple GOP of I, B, P, B, P, B frames, as previously mentioned, I frames are encoded spatially only and the P frames are forward predicted based on previous I or P frames. The B frames, however, are coded based on a forward prediction from a previous I or P frame, as well as a backward prediction from a succeeding I or P frame. As such, the example sequence is processed by the encoder such that the first B frame is predicted from the first I frame and first P frame, the second B frame is predicted from the second and third P frames, and the third B frame is predicted from the third P frame and the first I frame of the next GOP. Consequently, it is clear that to decode any B frame, the future P or I frame used to encode the B frame should be encoded and sent to the decoder to be able to decode the B frame. As it may be seen, the B frames are predicted from two images (previous and next frames), the encoder has more ability to correctly identify the motion vectors. This increase the efficiency of the compression rate. Thus, B frame size is typically very small with respect to that of P frame, which in turn small compared to that of I frame. It should be noted that, since B frames are not used to predict future frames, errors generated in B frames will not be propagated further within the sequence. Only those generated in I or P frames will propagate in the sequence of GOP. One disadvantage is that the frame reconstruction memory buffers within the encoder and decoder must be doubled in size to hold the two frames required to encode or decode B frame. Another disadvantage is that there will necessarily be a delay throughout the system as the frames are delivered out of order. The MPEG-1 encoder demands very high computational power, normally it requires dedicated hardware for real-time encoding. However, decoding can be done in software. MPEG-2 requires even more expensive hardware to encode video in real time. Both MPEG-1 and MPEG-2 are well suited to the purposes for which they were developed. For example, MPEG-1 works very well for playback from CD-ROM, and MPEG-2 is great for high-quality archiving applications and for TV broadcast applications. However, for existing computer and Internet infrastructures, MPEG-based solutions are too expensive and require too much bandwidth. To overcome this problem (among others), MPEG provided another standard named MPEG-4. It is aimed to be suitable for video conferencing (low bit rate, and can work in real time). MPEG-4 is based on the segmentation of audiovisual scenes into audio/visual objects that can be multiplexed for transmission over heterogeneous networks. The MPEG-4 framework currently being developed focuses on a language called MSDL (MPEG-4 Syntactic Description Language). MSDL allows applications to construct new codecs by composing more primitive components and providing the ability to dynamically download these components over the Internet.
next up previous contents index
Next: Audio Compression and Codecs Up: Standard Video Codecs Previous: H.263 Video Coding   Contents   Index
Samir Mohamed 2003-01-08