Basic Knowledge of Audio and Video Development

Today we will learn some basic knowledge about audio and video, which we encounter in our daily work, such as the development of audio and video. For example, in our current work, we deal with TSPlayer, IjkPlayer, and MediaPlayer to provide playback capabilities. Regardless of the player, the upper-level calls are largely similar, but the specific implementations and supported capabilities vary. To delve deeper, one must study audio and video thoroughly. The main directions in Android development include applications, Framework, audio and video, NDK, etc. If one continues in the Android field, these areas must be explored. The main content is as follows:

Video Encoding
Audio Encoding
Multimedia Playback Components
Frame Rate
Resolution
Encoding Format
Container Format
Bitrate
Color Space
Sampling Rate
Quantization Precision
Channels

Video Encoding#

Video encoding refers to the method of converting a video file format into another video format using specific compression techniques. The main coding and decoding standards in video transmission are as follows:

Motion JPEG (M-JPEG) by the Motion Picture Experts Group
- M-JPEG is an image compression coding standard, short for Motion-JPEG. The JPEG standard is mainly used for processing still images, while M-JPEG treats a sequence of moving video as a series of continuous still images. This compression method compresses each frame completely and independently, allowing for random storage of each frame during editing, enabling frame-accurate editing. M-JPEG compresses only the spatial redundancy within frames and does not compress the temporal redundancy between frames, resulting in lower compression efficiency.
MPEG series standards by the International Organization for Standardization (ISO) Motion Picture Experts Group
- The MPEG standards mainly include five: MPEG-1, MPEG-2, MPEG-4, MPEG-7, and MPEG-21. The video compression coding technology of the MPEG standards primarily utilizes inter-frame compression coding technology with motion compensation to reduce temporal redundancy, uses DCT technology to reduce spatial redundancy, and employs entropy coding to reduce statistical redundancy in information representation. The combined use of these technologies greatly enhances compression performance.
H.261, H.263, H.264, etc., by the International Telecommunication Union (ITU-T)
- H.261: The first practical digital video decoding standard, which uses a compression algorithm that combines motion compensation inter-frame prediction with block DCT. Its motion compensation uses full pixel accuracy and loop filtering, supporting both CIF and QCIF resolutions.
- H.263: H.263 has the same encoding algorithm as H.261 but has some improvements, allowing the H.263 standard to provide better image quality at lower bitrates. Its motion compensation uses half-pixel accuracy and supports five resolutions: CIF, QCIF, SQCIF, 4CIF, and 16CIF.
- H.264: H.264 is a new digital video encoding standard jointly developed by two organizations, ISO and ITU-T, through the Joint Video Team (JVT). Therefore, it is both ITU-T's H.264 and ISO/IEC's MPEG-4 Advanced Video Coding (AVC) Part 10. Thus, whether referred to as MPEG-4 AVC, MPEG-4 Part 10, or ISO/IEC 14496-10, it all refers to H.264. H.264 is a hybrid coding system based on traditional frameworks, optimized locally, focusing on encoding efficiency and reliability. H.264 provides high-quality smooth images while achieving a high compression ratio, and the video data compressed with H.264 requires less bandwidth during network transmission, making it the highest compression rate video compression standard.

Audio Encoding#

Common audio coding and decoding standards are as follows:

ITU: G.711, G.729, etc.
MPEG: MP3, AAC, etc.
3GPP: AMR, AMR-WB, AMR-WB+, etc.
There are also standards set by companies, such as Dolby AC-3, DTS, WMA, etc.

Common introductions are as follows:

MP3 (MPEG-1 audio layer 3): An audio compression technology designed to significantly reduce the amount of audio data, using MPEG Audio Layer 3 technology to compress music to a size that is 1:10 or even 1:12, while for most users, the playback quality does not significantly decline compared to the original uncompressed audio. It utilizes the human ear's insensitivity to high-frequency sound signals, converting time-domain waveform signals into frequency-domain signals, dividing them into multiple frequency bands, applying different compression rates to different bands, increasing the compression ratio for high frequencies (even ignoring the signal) and using a lower compression ratio for low-frequency signals to ensure no distortion. This effectively discards high-frequency sounds that the human ear cannot hear, retaining only the audible low-frequency parts, thus achieving some compression of the audio. Additionally, MP3 is a lossy compression file format.
AAC: Advanced Audio Coding, originally based on MPEG-2 audio encoding technology. After the emergence of MPEG-4, AAC reintegrated its features and added SBR technology and PS technology. To differentiate it from traditional MPEG-2 AAC, it is also known as MPEG-4 AAC. AAC is a file compression format specifically designed for audio data, and compared to MP3, AAC format offers better sound quality and smaller file sizes. However, AAC is a lossy compression format, and with the advent of high-capacity devices, its advantages are diminishing.
WMA: Windows Media Audio, developed by Microsoft, refers to a series of audio codecs and the corresponding digital audio encoding format. WMA includes four different codecs: WMA, the original WMA codec, which competes with MP3 and RealAudio codecs; WMA Pro, which supports more channels and higher quality audio; WMA Lossless, a lossless codec; and WMA Voice, used for storing voice with low bitrate compression. Some pure audio ASF files encoded in Windows Media Audio format also use WMA as the extension, characterized by supporting encryption, making it unplayable if illegally copied to local storage. WMA is also a lossy compression file format.

More audio and video coding and decoding standards can be referenced: Audio Coding Standards

Multimedia Playback Components#

Android multimedia playback components include MediaPlayer, MediaCodec, OMX, StageFright, AudioTrack, etc., as follows:

MediaPlayer: Provides playback control interfaces for the application layer.
MediaCodec: Provides access to the underlying media codec interfaces.
OpenMAX: Open Media Acceleration, abbreviated as OMX, is an open multimedia acceleration layer and a multimedia application standard. Android's main multimedia engine, StageFright, uses OpenMax through IBinder for encoding and decoding processing.
StageFright: Introduced in Android 2.2 to replace the preset media playback engine OpenCORE, StageFright is a media playback engine located at the Native layer, built-in with software-based codecs suitable for popular media formats. Its encoding and decoding functions utilize the OpenMAX framework, incorporating the omx-component part of OpenCORE, existing in Android as a shared library corresponding to libstagefright.so.
AudioTrack: Manages and plays a single audio resource, supporting only PCM streams, such as most WAV format audio files, which are PCM streams that AudioTrack can play directly.

Common Multimedia Frameworks and Solutions#

Common multimedia frameworks and solutions include VLC, FFmpeg, GStreamer, etc., as follows:

VLC: Video LAN Client, a free, open-source cross-platform multimedia player and framework.
FFmpeg: A multimedia solution, not a multimedia framework, widely used in audio and video development.
GStreamer: An open-source multimedia framework for building streaming media applications.

Frame Rate#

Frame rate is a measure of the number of frames displayed. The unit is "frames per second" (FPS) or "Hertz, Hz," indicating the number of frames per second (FPS) or the number of times the graphics processor can update per second while processing frames. A higher frame rate can achieve smoother and more realistic animations. Generally, 30fps is acceptable, but increasing performance to 60fps can significantly enhance interactivity and realism. However, generally, exceeding 75fps does not provide a noticeable improvement in smoothness. If the frame rate exceeds the screen refresh rate, it will only waste graphic processing capabilities, as the monitor cannot update at such a fast speed, making the frame rate exceeding the refresh rate wasted.

Resolution#

Video resolution refers to the size or dimensions of the images formed by video imaging products. Common resolutions like 1080P and 4K represent what? The "P" itself means progressive scan, indicating the total number of pixel rows in the video, with 1080P indicating a total of 1080 rows of pixels, while "K" indicates the total number of pixel columns, with 4K indicating 4000 columns of pixels. Generally, 1080P refers to a resolution of 1080 x 1920, while 4K refers to a resolution of 3840 x 2160.

Refresh Rate#

Refresh rate is the number of times the screen image is refreshed per second. Refresh rate is divided into vertical refresh rate and horizontal refresh rate. The refresh rate usually mentioned refers to the vertical refresh rate, which indicates how many times the screen image is redrawn per second, or the number of times the screen refreshes per second, measured in Hz (Hertz). The higher the refresh rate, the better, resulting in more stable images, clearer displays, and less impact on the eyes. A lower refresh frequency leads to more severe image flickering and jitter, causing faster eye fatigue. Generally, achieving a refresh frequency of over 80Hz can completely eliminate image flickering and jitter, making it less likely for the eyes to become fatigued.

Encoding Format#

For audio and video, the encoding format corresponds to audio encoding and video encoding. Referring to the previous audio and video encoding standards, each encoding standard corresponds to a specific encoding algorithm, aiming to achieve data compression and reduce data redundancy through certain encoding algorithms.

Container Format#

Looking directly at Baidu Baike's introduction to container formats, a container format (also called a wrapper) is a file format that packages already encoded and compressed video and audio tracks in a specific format. In other words, it is merely a shell, or one can think of it as a folder for video and audio tracks. To put it simply, the video track is like rice, and the audio track is like dishes; the container format is a bowl or pot used to hold the rice and dishes.

Bitrate#

Bitrate, also known as bit rate, refers to the number of bits transmitted or processed per unit of time, measured in bps (bits per second) or b/s. A higher bitrate means a larger amount of data (bits) transmitted per unit of time. In the multimedia industry, when referring to the data transmission rate of audio or video over a unit of time, bitrate is usually used, measured in kbps. Generally, if one has a 1M broadband connection, they can only watch videos with a bitrate not exceeding 125kbps online. Videos exceeding 125kbps can only be watched smoothly after buffering.

Bitrate is generally divided into fixed bitrate and variable bitrate:

Fixed bitrate guarantees a constant bitrate for the stream but sacrifices video quality. For instance, to maintain a constant bitrate, some rich content may lose certain image details and become blurry.
Variable bitrate means the output stream's bitrate is variable because the peak information of the video source itself changes. From the perspective of ensuring video transmission quality and fully utilizing information, variable bitrate video encoding is the most reasonable.

The bitrate is directly proportional to video quality and file size, but when the bitrate exceeds a certain value, it does not affect video quality.

Color Space#

YUV: A color encoding method generally used in image processing components. YUV considers human perception when encoding photos or videos, allowing for reduced chroma bandwidth. Y represents brightness, U represents chroma, and V represents concentration. The ranges referred to by Y′UV, YUV, YCbCr, and YPbPr often have confusion or overlap. Historically, YUV and Y'UV are typically used to encode analog television signals, while YCbCr is used to describe digital image signals, suitable for video and image compression and transmission, such as MPEG and JPEG. Nowadays, YUV is widely used in computer systems.
RGB: The primary color light model, also known as the RGB color model or red-green-blue color model, is an additive color model that combines red (R), green (G), and blue (B) light in different proportions to produce various colors. Most modern displays adopt the RGB color standard.

YUV is mainly used to optimize the transmission of color video signals, ensuring backward compatibility with older black-and-white televisions. Compared to RGB video signal transmission, its greatest advantage is that it occupies very little bandwidth.

Sampling Rate#

The sampling rate indicates the number of samples extracted from a continuous signal to form a discrete signal per second, measured in Hertz (Hz). The sampling rate refers to the sampling frequency when converting an analog signal into a digital signal. The human ear can generally hear sounds between 20Hz and 20KHz. According to the sampling theorem, when the sampling frequency is greater than twice the highest frequency in the signal, the resulting digital signal can accurately reflect the original signal. Common sampling rates are as follows:

8000 Hz: Sampling rate used for telephones, sufficient for human speech.
11025 Hz: Sampling rate used for AM amplitude modulation broadcasting.
22050 Hz and 24000 Hz: Sampling rates used for FM frequency modulation broadcasting.
44100 Hz: Audio CD, commonly used sampling rate for MPEG-1 audio (VCD, SVCD, MP3).
47250 Hz: Sampling rate used for commercial PCM recorders.
48000 Hz: Sampling rate used for miniDV, digital television, DVD, DAT, movies, and professional audio.

The standard sampling frequency for CD music is 44.1KHz, which is also the most commonly used sampling frequency between sound cards and computer operations. The currently popular Blu-ray sampling rate is quite high, reaching 192kHz. Most sound cards can support 44.1kHz, 48kHz, and 96kHz, while high-end products can support 192kHz or even higher. In summary, the higher the sampling rate, the better the quality of the sound files obtained, but it also occupies more storage space.

Quantization Precision#

The conversion of sound waves into digital signals is influenced not only by the sampling rate but also by an important factor: quantization precision. The sampling frequency pertains to the number of samples taken per second, while quantization precision refers to the division of the amplitude of sound waves. The number of divisions is calculated as the maximum amplitude divided by 2 raised to the power of n, where n is the bit count, and the bit count represents audio resolution.

Additionally, the number of bits also determines the range of sound wave amplitudes (i.e., dynamic range, the difference between maximum and minimum volume). A larger bit count allows for a greater range of values, resulting in a more precise description of the waveform. Each bit of data can record a signal with approximately 6dB of dynamic range. Generally, 16Bit can provide a maximum dynamic range of 96dB (after high-frequency fluctuations, only 92dB). It can be inferred that 20Bit can achieve a dynamic range of 120dB. A larger dynamic range has benefits; it refers to the ratio of the system's output noise power to the maximum undistorted volume power. The larger this value, the higher the dynamic range the system can withstand.

Channels#

Channels refer to the independent audio signals collected or played back in different spatial locations during recording or playback. Thus, the number of channels corresponds to the number of sound sources during recording or the number of corresponding speakers during playback. Common channels include mono, stereo, 4 channels, 5.1 channels, and 7.1 channels, as follows:

Mono: Set up one speaker.
Stereo: Expands a mono speaker into two symmetrically positioned speakers, with sound allocated to two independent channels during recording, achieving excellent sound localization. This technology is particularly useful in music appreciation, allowing listeners to clearly distinguish the origins of various instruments, making the music more imaginative and closer to a live experience. Stereo technology has been widely applied in many sound cards since the Sound Blaster Pro, becoming a far-reaching audio standard.
4 Channels: The 4-channel surround system specifies 4 sound points: front left, front right, rear left, and rear right, with listeners surrounded in the middle. It is also recommended to add a subwoofer to enhance the playback of low-frequency signals, which is why the 4.1 channel speaker system is widely popular today. Overall, the 4-channel system can provide listeners with surround sound from multiple directions, offering a new experience.
5.1 Channels: The 5.1 channel system originates from the 4.1 channel system, splitting the surround channel into left surround and right surround, adding a central position for enhanced bass effects.
7.1 Channels: The 7.1 channel system adds two sound points, center left and center right, based on the 5.1 channel system, essentially establishing a balanced sound field around the listener, enhancing the rear center sound field channel.