A Perceptual Audio Coder (and decoder)

ABSTRACT
A perceptual audio coder was written for Matlab which applies simple masking curves to determine the proper bit allocation for the quantization of MDCT coefficients. This coder was determined to be transparent at 256kbps and nearly transparent at 128kbps, depending on material. The perceptual coder was also compared to an LSE coder and was found to be significantly worse. Several of the artifacts that deteriorate the quality of the perceptual coder are well known, and will be fixed in upcoming versions of this coder.

Table of Contents
1.     Introduction
     1.1     Audio Coding
     1.2     Masking
2.     Design
     2.1     Frames
     2.2     Signal-to-Mask Ratio
     2.3     Bit Allocation
     2.4     Quantization
     2.5     Writing the file
3.     Subjective Testing
     3.1     Subjective Difference Grade
     3.2     Comparison to LSE Coder
4.     Conclusions
     4.1     Current Coder
     4.2     Future Improvements

APPENDIX

RESOURCES




1. INTRODUCTION

1.1 Audio Coding
Over the past decade, audio coding has become a hot topic due to the popularity of internet media, and improvements in movie technology. Various methods have been proposed to improve low bit-rate audio quality. Some coders, such as Meridian Lossless Packing, WaveArc, Pegasus SPS, Sonarc, WavPack, AudioZip, Monkey, and FLAC are considered “lossless” because they retain all the original audio data while reducing bit-rate. Others coders are called “lossy” because they throw away portions of the audio stream that cannot be easily heard, and many are commonly in use today including MP3, WMA, AAC, PAC. Such audio coders are called perceptual audio coders, because they rely on human perception of sound.

1.2 Masking
Many portions of an audio stream cannot actually be heard. Any sound with intensity below a certain threshold (called the threshold in quiet) cannot be heard, due to the limits of the ear’s sensitivity. Sometimes, sounds above the threshold in quiet cannot be heard because other sounds cover them up. This is due to a psychoacoustic phenomenon known as masking. If two separate tones are close enough in frequency, one tone may actually cover up the other one. The tone that is heard is called the masker, and the tone which is not heard is called the maskee.

The above phenomenon is known as simultaneous masking. There is another phenomenon known as temporal masking. Temporal masking is the masking of a sound before or after the masker event occurs.


2. DESIGN
2.1 Frames
The first step in designing an audio coder is to segment the audio stream into frames. A frame is a short section of audio, typically less than 50ms each. At a sampling frequency of 44.1kHz, a frame of 2048 samples is about 46ms long. Enframing the audio stream allows the engineer to treat each frame as a relatively stationary sound. Frame lengths longer than 50ms are typically not used, since pleasant sounding audio is non-stationary. Large frame sizes are good for audio signals that change relatively slowly. Longer frame lengths also allow greater frequency resolution, because the FFT length is also greater. Shorter frame sizes may be used for more transient sounds, to avoid an artifact known as pre-echo. Pre-echo occurs when a transient sound is averaged over a long period of time (such as 46ms), and causes an attack to sound blurred. Variable bit-rate coders take advantage of both long and short frame lengths by changing the frame length dynamically, depending on the material. The coder presented in this paper uses a fixed frame length of 2048 samples.
2.2 Signal-to-Mask Ratio
There are many ways to calculate the masking threshold. In general, the masking threshold varies depending on the frequency and intensity of the masker signal. The width of the masking curve typically extends farther in the direction of higher frequencies than toward lower frequencies, and the amplitude depends on both the frequency and the tonality of the masker. Noise tends to mask much more than tones do.
In order to calculate the masking threshold, the first step is to calculate the FFT of the frame, and find the spectral peaks. To find the peaks, simply search for every point where the slope changes from positive to negative. Each of these peaks corresponds to individual frequencies in the signal. The next step is to calculate the approximate SPL of each peak. One way to do this is to normalize such that a full-amplitude tone at 1kHz is equal to 96dB. The SPL of each peak can then be easily calculated according to the following formula:

From these SPL values, a masking curve can be created for each peak. There are several masking functions that do this. This coder uses the function suggested by Schroeder:

where dz is the bark frequency, defined by the following equation:

where f is the frequency in kHz.

The next step is to combine all the masking curves with the Threshold in Quiet (TIQ). The TIQ is the minimum SPL that a person can hear at a given frequency, and is typically defined by the following equation:

where f is the frequency in kHz.

There are a variety of ways to combine the masking curves and the TIQ which correspond to different values of alpha in the equation

In this coder, alpha is zero, which corresponds to using the highest masking curve (or the threshold in quiet). The signal-to-mask ratio (SMR) can easily be calculated by dividing the SPL of the signal by the masking threshold.

2.3 Bit Allocation
From the SMR, it can determined which frequency bands should receive the most bits. As a general rule, each bit increases signal-to-noise ratio by about 6dB. Therefore, allocating a bit for each 6dB of SMR would ensure that quantization noise is below the masking threshold, and thus inaudible. However, there may not be enough bits available to do this, bits must be allocated to where they are needed most. The “water-filling” bit allocation algorithm is used to allocate bits by looking for the maximum value of the SMR, allocating a bit to that subband, subtracting 6dB from the SMR at that frequency, and repeating as long as bits are available to allocate.

2.4 Quantization
After determining where bits should be allocated, the next step is to quantize the audio signal to the appropriate number of bits. This audio coder is based on the Modified Discrete Cosine Transform (MDCT), so the MDCT coefficients are quantized. The MDCT of the original time-domain frame must first be computed. Then the coefficients must be attenuated because values as large as those typically found in the MDCT cannot typically be quantized. Therefore, an attenuation factor is chosen equal to the maximum value found in the MDCT, reducing the maximum value that needs to be quantized to unity. After attenuating the coefficients, they are quantized according to the bit allocation scheme determined earlier.

2.5 Reading/Writing the files
Once the MDCT coefficients are quantized, they can be written to a file. In addition to the MDCT coefficients, the gain factor must also be specified as well as the number of bits allocated to each band. In this coder, a file header is also included which contains information such as the sampling frequency, frame length, bit rate, number of bits used for writing the gain factor, and the number of frames in the file. Because only a few bits are to be used to represent the gain factor, the logarithm of the gain is written to the file.






3. SUBJECTIVE TESTING
3.1 Subjective Difference Grade
A standard means of conducting subjective listening tests is to have a variety of listeners listen to various sounds and rate them on a scale of 1-5.


The subjective difference grade is determined by subtracting the grade of the reference signal from the grade of the coded signal. The perceptual coder was used to encode three difference sounds- a flute, drums, and speech. The results of subjective tests for the three signals are shown below.


3.2 Comparison to LSE Coder
An alternative audio coder was designed to allocate bits based simply on energy, thus producing an audio signal with the Least Squared Error (LSE). The results of subjective testing of this LSE coder are compared to those of the perceptual coder in the following graph:


4. CONCLUSIONS
4.1 Current Coder
As a simple audio coder, this algorithm appears to work fairly well. However, as a perceptual audio coder, it does not perform well at all. It is greatly inferior to the LSE coder, suggesting that the perceptual model needs improvement. The first problem is pre-echo. Although pre-echo is not the most annoying artifact, it is present due to the large frame length. There is another artifact present known as “birdies” which adds a flying-saucer type sound. This is created when the masking thresholds change slightly over time, causing drastic changes in bit allocation. Each of these artifacts caused by the perceptual coder can be minimized by adjusting certain things in the coder. Pre-echo can be minimized by allowing variable frame lengths. It may also be possible to minimize birdies by enhancing the perceptual model.

4.2 Future Improvements
Next semester, I plan to improve this coder by adding a number of advanced functionalities and improving the existing code. The first change I plan to make concerns the perceptual model. The current model is based on a single function that only varies slightly with frequency. I plan to do some research to determine the best perceptual models to employ, and to implement masking functions that vary with frequency, loudness, and tonality. The next change I will make is to add variable bit-rate capability. This will allow the frames to change to a small block size when transients are present, and switch back to long frame lengths when the sound is relatively constant. This will decrease pre-echo and speech reverberation while maintaining good frequency resolution when needed. I also plan to incorporate stereo coding into the next version of this coder. This will allow stereo files to be encoded more efficiently, while allowing me to learn about binaural masking.


APPENDIX
Matlab Source File

Encoded Sounds: (Click Here to Downloads These Sounds)
Original 128kbps 64kbps 32kbps
Flute Flute - Perceptual
Flute - Spectral Power
Flute - Perceptual
Flute - Spectral Power
Flute - Perceptual
Flute - Spectral Power
Drums Drums - Perceptual
Drums - Spectral Power
Drums - Perceptual
Drums - Spectral Power
Drums - Perceptual
Drums - Spectral Power
Speech Speech - Perceptual
Speech - Spectral Power
Speech - Perceptual
Speech - Spectral Power
Speech - Perceptual
Speech - Spectral Power



RESOURCES

M. Bosi, R. Goldberg, Introduction to Digital Audio Coding and Standards, Kluwer Academic Publishers, 2003.
Audio Engineering Society CD-ROM, “Perceptual Audio Coders: What to Listen For”, AES 2001.






Jon Boley
http://www.jboley.com