Audio Basics

After understanding the relevant knowledge of audio and video, you can read the following two articles first:

This article summarizes the basic knowledge of audio and will introduce it from the following aspects:

Generation of sound
Three elements of sound
Analog-to-digital conversion
Raw audio data
PCM and Wav
Audio processing flow

Generation of sound#

Sound is produced by the vibration of objects and is a type of sound wave transmitted through media such as air, solids, and liquids. The range of sound waves that can be recognized by the human ear is between 20Hz and 20000Hz, also known as audible sound waves. According to the different frequencies of sound waves, they can be mainly divided into:

Audible sound waves: 20Hz~20kHz
Ultrasound: > 20kHz
Infrasound: < 20Hz

In addition, the range of human vocalization is generally 85Hz~1100Hz.

Three elements of sound#

The three elements of sound are pitch, volume, and timbre, as follows:

Pitch: refers to the high and low frequency of sound, which represents the degree to which a person's hearing distinguishes the pitch of a sound. When an object vibrates quickly, the pitch of the sound it emits is high, and when it vibrates slowly, the pitch is low.
Volume: also known as sound intensity or loudness, refers to the amplitude of sound, which represents the subjective perception of the loudness of the sound by the human ear.
Timbre: also known as tone quality, refers to the fact that different sounds always have unique characteristics in terms of waveform. Different objects have different characteristics when vibrating, reflecting the unique quality of the sound emitted by each object. The timbre is specifically determined by harmonics, and a pleasant sound is not just a sine wave, but harmonics.

Analog-to-digital conversion#

Sound is an analog audio signal. If you want to digitize sound, you need to convert the analog audio signal into a digital signal. This is called analog-to-digital conversion, and the main process includes sampling, quantization, and encoding, as shown in the following figure:

Sampling: the process of converting a continuous signal into a discrete signal, replacing the continuous signal values in a certain time with a certain number of signal values. The number of samples in 1 second is the sampling rate. For example, 8KHz is the sampling rate of telephone signals, which can meet the needs of communication. Audio CDs generally have a sampling rate of 44.1kHz, and digital TV generally has a sampling rate of 48kHz. The higher the sampling rate, the higher the fidelity of the sound reproduction.
Quantization: the process of converting the sampled analog signal into a digital signal. Quantization can be uniform or non-uniform. The figure above clearly uses uniform quantization with 8 quantization levels.
Encoding: the process of converting the quantized signal into the corresponding binary code. The simplest is natural binary code. If you are interested in other encoding methods, you can learn about them yourself. The encoding mentioned in the figure refers to source encoding, in addition to which there is channel encoding.

Raw audio data#

PCM (Pulse Code Modulation) is a pulse code modulation, which actually converts analog audio signals into digital audio signals. In audio and video, PCM refers to uncompressed audio sampling data, which is the original audio data generated by sampling, quantization, and encoding of audio signals. The key quantization indicators of PCM data are as follows:

Sample Size: the size of each sample, which is also the number of quantization levels. It represents how many bits are used to store each sample. The commonly used size is 16 bits.
Sample Rate: the number of samples per unit time, in Hz. Common sample rates include 8k, 16k, 32k, 44.1k, 48k, etc.
Number of Channels: the number of channels in the current PCM data, such as mono, stereo, multi-channel, etc.
Byte Ordering: the byte order in which PCM data is stored, either big-endian or little-endian. Usually, little-endian is used for efficient data processing.
Sign: indicates whether the PCM data has a sign bit.
Integer or Floating Point: indicates whether the PCM data is represented using integer or floating-point.

After understanding the quantization indicators that represent PCM data, how is the bitrate calculated? The bitrate is the amount of sample data per second. The calculation is as follows:

Sample Rate * Sample Size * Number of Channels

For example:

For a PCM-encoded WAV file with a sample rate of 44.1KHz, a sample size of 16 bits, and stereo channels, the bitrate is 44.1K *_16 *_2 = 1411.2Kb/s. If you transmit such audio, you will have to deal with data volume exceeding 1M per second. In addition, the upload speed is often much slower than the download speed during data transmission, so audio data compression is needed.

PCM and WAV#

PCM can be understood as mentioned in the previous section, WAV is a lossless audio file format, and there are no rigid rules for encoding audio in WAV. It can be PCM or other encoding methods, such as MP3 encoding. In summary:

PCM: a coding method, in the field of audio and video, it refers to the raw audio data stream.
WAV: an audio file format that can store PCM data. It is equivalent to adding a WAV header to PCM.

Finally, here is a schematic diagram of the WAV header:

More details will be added after further research.

Audio Processing Flow#

Let's briefly explain the audio processing flow. First, the generation of audio files. For example, in Android, the audio data captured using AudioRecord and MediaRecord is PCM data, which is a digital audio signal and the raw PCM stream. The PCM data is then compressed through encoding to generate the corresponding audio file. Second, the playback of audio files. After demultiplexing and decoding, the audio is converted to PCM for playback.