Audio frames, video frames, and their synchronization

Previous introduced the basic knowledge of audio and video development. This article introduces the main parameters and analysis methods of audio frames, video frames, and audio and video synchronization. The main content is as follows:

Audio Frames
Video Frames
PTS and DTS
Audio and Video Synchronization

Audio Frames#

The concept of audio frames is not as clear as that of video frames. Almost all video encoding formats can be simply regarded as a frame is a coded image. However, audio frames may vary depending on the encoding format. For example, PCM audio streams can be played directly. The following is an introduction to audio frames using the MPEG audio frame format as an example.

Frame Size#

Frame size refers to the number of samples per frame, which is a constant value. The specific values are as follows:

	MPEG 1	MPEG 2	MPEG 2.5
Layer Ⅰ	384	384	384
Layer Ⅱ	1152	1152	1152
Layer Ⅲ	1152	576	576

Frame Length#

Frame length refers to the length of each frame during compression, including the frame header and padding. Due to padding and bit rate changes, the frame length is not constant. The padding bit can be obtained from the 9th bit of the frame header. If it is 0, there is no padding bit; if it is 1, there is a padding bit. The explanation of the padding bit is as follows:

Padding is used to fit the bit rates exactly. For an example: 128k 44.1kHz layer II uses a lot of 418 bytes and some of 417 bytes long frames to get the exact 128k bitrate. For Layer I slot is 32 bits long, for Layer II and Layer III slot is 8 bits long.

It can be seen that the padding bit for Layer Ⅰ is 4 bytes, and the padding bit for Layer Ⅱ and Layer Ⅲ is 1 byte. When reading MPEG files, this value must be calculated to find adjacent frames. The formula for calculating the frame length is as follows:

// Layer I(SampleSize = 384) Unit: byte
FrameLengthInBytes = SampleSize / 8 * BitRate / SampleRate + Padding * 4
FrameLengthInBytes = 48 * BitRate / SampleRate + Padding * 4
// Layer II & III(SampleSize = 1152) Unit: byte
FrameLengthInBytes = SampleSize / 8 / SampleRate + Padding
FrameLengthInBytes = 144 * BitRate / SampleRate + Padding

Where SampleSize represents the number of samples per frame, which is a fixed value that can be found in the frame size section, Padding represents the padding bit, BitRate represents the bit rate, and SampleRate represents the sample rate. The values of the bit rate and sample rate can be obtained from the frame header.

If an MP3 audio file has a bit rate of 320kbps, a sample rate of 44.1KHz, and no padding bit, the frame length of the file is approximately 144 x 320 / 44.1 ≈ 1044 bytes.

Bit Rate#

The bit rate can be obtained from bits 12 to 15 of the MPEG audio frame header, in kbps. The reference values are as follows:

bits	V1，L1	V1，L2	V1，L3	V2，L1	V2，L2 & L3
0000	free	free	free	free	free
0001	32	32	32	32	8
0010	64	48	40	48	16
0011	96	56	48	56	24
0100	128	64	56	64	32
0101	160	80	64	80	40
0110	192	96	80	96	48
0111	224	112	96	112	56
1000	256	128	112	128	64
1001	288	160	128	144	80
1010	320	192	160	160	96
1011	352	224	192	176	112
1100	384	256	224	192	128
1101	416	320	256	224	144
1110	448	384	320	256	160
1111	bad	bad	bad	bad	bad

Regarding the explanations in the table:

V1: MPEG Version 1
V2: MPEG Version 2 and Version 2.5
L1: Layer Ⅰ
L2: Layer Ⅱ
L3: Layer Ⅲ

MPEG files may have variable bit rates, which means that the bit rate can change. Knowing how to obtain the bit rate is sufficient.

Sample Rate#

The sample rate can be obtained from bits 10 to 11 of the MPEG audio frame header, in Hz. The reference values are as follows:

bits	MPEG1	MPEG2	MPEG2.5
00	44100	22050	11025
01	48000	24000	12000
10	32000	16000	8000
11	reserv.	reserv.	reserv.

Duration of Each Frame#

The duration of each frame can be calculated using the following formula:

// Unit: ms
FrameTime = SampleSize / SampleRate * 1000

Where SampleSize represents the number of samples, which is the frame size, and SampleRate represents the sample rate.

For example, for an MP3 audio file with a sample rate of 44.1KHz, the duration of each frame is 1152 / 44100 * 1000 ≈ 26 ms. This is why the playback time of each frame of an mp3 file is fixed at 26ms.

Video Frames#

In video compression technology, different compression algorithms are used for video frames to reduce data volume. Usually, only the differences between images are encoded, and the same element information does not need to be repeatedly transmitted. The different algorithms for video frames are generally referred to as picture types or frame types. The three main picture types are I, P, and B, with the following characteristics:

I-frame: Intra-coded frame, usually the first frame of each GOP (explained below). It has the lowest compressibility and can be decoded without other video frames. It can be considered as a complete image. Typically, I-frames are used for random access and serve as references for decoding other frames.
P-frame: Predictive-coded frame, representing the differences between the current frame and the previous frame (I or P-frame). It needs to refer to the previous I-frame or P-frame to generate a complete image. Compared to I-frames, P-frames have higher compressibility and save space, so they are also called delta frames.
B-frame: Bi-directional predictive-coded frame, representing the differences between the current frame and the previous and subsequent frames. It needs to refer to the previous I-frame or P-frame and the subsequent P-frame to generate a complete image. B-frames have the highest compressibility.

The frames or pictures mentioned above are usually divided into several macroblocks. A macroblock is the basic unit of motion prediction. A complete image is usually divided into several macroblocks. For example, in MPEG-2 and earlier codecs, macroblocks are defined as 8×8 pixels. Specific prediction types are selected based on macroblocks, rather than using the same prediction type for the entire image. The specifics are as follows:

I-frame: Contains only intra macroblocks.
P-frame: Can contain intra macroblocks or predictive macroblocks.
B-frame: Can contain intra, predictive, and bi-directional predictive macroblocks.

The following is a diagram of I-frames, P-frames, and B-frames:

In the H.264 / MPEG-4 AVC standard, the granularity of prediction types is reduced to the slice level. A slice is a different region of a frame in terms of space. This region is encoded separately from any other region in the same frame. I slices, P slices, and B slices replace I, P, and B frames. This part of the content is temporarily understood as this.

As mentioned earlier, GOP stands for Group of Pictures. Each GOP starts with an I-frame, followed by P-frames and B-frames. The following figure shows an example:

The order shown in the figure is:

I1, B2, B3, B4, P5, B6, B7, B8, P9, B10, B11, B12, I13

The decoding order is:

I1, P5, B2, B3, B4, P9, B6, B7, B8, I13, B10, B11, B12

The subscript numbers represent the PTS in the original frame data, which can be understood as the position in the GOP.

DTS and PTS#

DTS (Decoding Time Stamp): Represents the decoding time of the compressed frame, which tells the player when to decode the data of this frame.
PTS (Presentation Time Stamp): Represents the display time of the original frame obtained after decoding the compressed frame, which tells the player when to display the data of this frame.

For audio, DTS and PTS are the same. For video, since B-frames are bi-directional predictive frames, DTS and PTS are different. If there are no B-frames in each GOP, DTS and PTS are the same; otherwise, they are different. For example:

	I	B	B	P	B	P
Display	I1	B2	B3	P4	B5	P6
Decoding	I1	P4	B2	B3	P6	B5
PTS	1	2	3	4	5	6
DTS	1	4	2	3	6	5

When the receiver receives the bitstream and decodes the frames, the order is obviously not the correct order. It needs to be reordered based on PTS before displaying.

Audio and Video Synchronization#

Let's briefly introduce the process of video playback. After the microphone and camera capture data, they are encoded separately for audio and video. Then, they are multiplexed, which means that audio and video are packaged into a media file. When receiving a media file, it needs to be demultiplexed to separate audio and video. Then, they are decoded separately for audio and video playback. Due to the differences in playback rates, audio and video may be out of sync. The two main indicators for audio and video playback are:

Audio: Sample rate
Video: Frame rate

Sound cards and graphics cards usually play based on each frame of data. Therefore, it is necessary to calculate the playback duration of each audio and video frame. The same example is used:

From the previous information, it is known that the duration of each frame of an MP3 audio file with a sample rate of 44.1KHz is 26 ms. If the frame rate of the video is 30fps, the duration of each video frame is 1000 / 30 ≈ 33ms. If the ideal situation can be achieved and the calculated values are used for playback, audio and video can be considered in sync.

In reality, due to various reasons, audio and video may be out of sync. For example, the decoding and rendering time of each frame may differ, and video frames with rich colors may take longer to decode and render than video frames with single colors. There may also be calculation errors. There are three main methods for audio and video synchronization:

Video synchronization to audio
Audio synchronization to video
Audio and video synchronization to an external clock

Usually, the video is synchronized to the audio clock because, for delay and stuttering, human hearing is more sensitive than vision. Therefore, it is necessary to keep the audio output normal as much as possible. The concept of audio and video synchronization allows for a certain amount of delay, which means that the delay must be within an acceptable range. It is like a feedback mechanism. When the video is slower than the audio, the playback speed of the video needs to be increased. It is possible to drop frames appropriately to catch up with the audio. If there is already a delay, it can also be reduced. Conversely, if the video is faster than the audio, the playback speed of the video needs to be reduced.