MPEG Audio Compression Fundamentals

Fundamentals of Multimedia, Chapter 14
Chapter 14
MPEG Audio Compression
14.1 Psychoacoustics
14.2 MPEG Audio
14.3 Other Commercial Audio Codecs
14.4 The Future: MPEG-7 and MPEG-21
14.5 Further Exploration
1 c Prentice Hall 2003

Li & Drew !
14.1 Psychoacoustics
The range of human hearing is about 20 Hz to about 20 kHz
The frequency range of the voice is typically only from about

500 Hz to 4 kHz
The dynamic range, the ratio of the maximum sound ampli-

tude to the quietest sound that humans can hear, is on the
order of about 120 dB

Li & Drew !
Frequency Masking
Lossy audio data compression methods, such as MPEG/Audio

encoding, remove some sounds which are masked anyway
The general situation in regard to masking is as follows:

1. A lower tone can effectively mask (make us unable to
hear) a higher tone
2. The reverse is not true a higher tone does not mask a
lower tone well
3. The greater the power in the masking tone, the wider is
its influence the broader the range of frequencies it can
mask.
4. As a consequence, if two tones are widely separated in
frequency then little masking occurs

Li & Drew !
Threshold of Hearing
A plot of the threshold of human hearing for a pure tone

60
50
40
30
dB
20
10
0
10
102 103 104
Hz
Fig. 14.2: Threshold of human hearing, for pure tones

Li & Drew !
Threshold of Hearing (contd)
The threshold of hearing curve: if a sound is above the dB

level shown then the sound is audible
Turning up a tone so that it equals or surpasses the curve

means that we can then distinguish the sound
An approximate formula exists for this curve:
2
Threshold(f ) = 3.64(f /1000)0.8 6.5 e0.6(f /10003.3) + 103 (f /1000)4
(14.1)
The threshold units are dB; the frequency for the origin
(0,0) in formula (14.1) is 2,000 Hz: Threshold(f ) = 0 at
f =2 kHz

Li & Drew !
Frequency Masking Curves
Frequency masking is studied by playing a particular pure

tone, say 1 kHz again, at a loud volume, and determining how
this tone affects our ability to hear tones nearby in frequency
one would generate a 1 kHz masking tone, at a fixed

sound level of 60 dB, and then raise the level of a nearby
tone, e.g., 1.1 kHz, until it is just audible
The threshold in Fig. 14.3 plots the audible level for a single
masking tone (1 kHz)
Fig. 14.4 shows how the plot changes if other masking tones
are used

Li & Drew !
70
60
Audible tone
50
40 Inaudible tone
dB
30
20
10
0
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Frequency (kHz)
Fig. 14.3: Effect on threshold for 1 kHz masking tone

Li & Drew !
70
1 4 8
60
50
40
dB
30
20
10
0
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Frequency (kHz)
Fig. 14.4: Effect of masking tone at three different frequencies

Li & Drew !
Temporal Masking
Phenomenon: any loud tone will cause the hearing receptors

in the inner ear to become saturated and require time to
recover
The following figures show the results of Masking experi-

ments:

Li & Drew !
60
40 Test tone
dB
Mask tone
20
5 0 10 100 1000
Delay time (ms)
Fig. 14.6: The louder is the test tone, the shorter it takes for
our hearing to get over hearing the masking.

Li & Drew !
60
50
40
Level (dB)
30
20
10
0 Tones below surface

are inaudible
8
10
0
6
0.01
4 Frequency
0.02
Time
0.03 0
Fig. 14.7: Effect of temporal and frequency maskings depending

on both time and closeness in frequency.
Li & Drew !
14.2 MPEG Audio
MPEG audio compression takes advantage of psychoa-

coustic models, constructing a large multi-dimensional lookup
table to transmit masked frequency components using fewer
bits
MPEG Audio Overview

1. Applies a filter bank to the input to break it into its fre-
quency components
2. In parallel, a psychoacoustic model is applied to the data
for bit allocation block
3. The number of bits allocated are used to quantize the
info from the filter bank providing the compression

Li & Drew !
MPEG Layers
MPEG audio offers three compatible layers :
Each succeeding layer able to understand the lower layers
Each succeeding layer offering more complexity in the psy-

choacoustic model and better compression for a given
level of audio quality
each succeeding layer, with increased compression effec-

tiveness, accompanied by extra delay
The objective of MPEG layers: a good tradeoff between

quality and bit-rate

Li & Drew !
MPEG Layers (contd)
Layer 1 quality can be quite good provided a comparatively

high bit-rate is available
Digital Audio Tape typically uses Layer 1 at around 192 kbps
Layer 2 has more complexity; was proposed for use in Digital

Audio Broadcasting
Layer 3 (MP3) is most complex, and was originally aimed at

audio transmission over ISDN lines
Most of the complexity increase is at the encoder, not the

decoder accounting for the popularity of MP3 players

Li & Drew !
MPEG Audio Strategy
MPEG approach to compression relies on:

Quantization
Human auditory system is not accurate within the width
of a critical band (perceived loudness and audibility of a
frequency)
MPEG encoder employs a bank of filters to:

Analyze the frequency (spectral) components of the au-
dio signal by calculating a frequency transform of a win-
dow of signal values
Decompose the signal into subbands by using a bank of
filters (Layer 1 & 2: quadrature-mirror; Layer 3: adds
a DCT; psychoacoustic model: Fourier transform)

Li & Drew !
MPEG Audio Strategy (contd)
Frequency masking: by using a psychoacoustic model to

estimate the just noticeable noise level:
Encoder balances the masking behavior and the available
number of bits by discarding inaudible frequencies
Scaling quantization according to the sound level that is
left over, above masking levels
May take into account the actual width of the critical bands:
For practical purposes, audible frequencies are divided into
25 main critical bands (Table 14.1)
To keep simplicity, adopts a uniform width for all fre-
quency analysis filters, using 32 overlapping subbands

Li & Drew !
MPEG Audio Compression Algorithm
What to drop
Audio
(PCM) Encoded
input Time to Bit allocation, bitstream
Bitstream
frequency quantizing and
formatting
transformation coding
Psychoacoustic
modeling
Encoded Decoded
bitstream Bitstream Frequency Frequency PCM audio
sample to time
unpacking
reconstruction transformation
Fig. 14.9: Basic MPEG Audio encoder and decoder.

Li & Drew !
Basic Algorithm (contd)
The algorithm proceeds by dividing the input into 32 fre-

quency subbands, via a filter bank
A linear operation taking 32 PCM samples, sampled in
time; output is 32 frequency coefficients
In the Layer 1 encoder, the sets of 32 PCM values are first

assembled into a set of 12 groups of 32s
an inherent time lag in the coder, equal to the time to
accumulate 384 (i.e., 1232) samples
Fig.14.11 shows how samples are organized

A Layer 2 or Layer 3, frame actually accumulates more
than 12 samples for each subband: a frame includes 1,152
samples

Li & Drew !
12 12 12
samples samples samples
Subband filter 0
12 12 12
Subband filter 1
12 12 12
Audio (PCM) Subband filter 2
samples In
...
...
...
...
12 12 12
Subband filter 31
Layer 1
Each subband filter produces 1 sample out Frame
for every 32 samples in
Layer 2 and Layer 3
Frame
Fig. 14.11: MPEG Audio Frame Sizes

Li & Drew !
Mask calculations are performed in parallel with subband fil-

tering, as in Fig. 4.13:
PCM
audio signal
Filter bank: Linear Bitstream

32 subbands quantizer formatting Coded audio
signal
1,024-point Psychoacoustic Side-information

FFT model coding
Fig. 14.13: MPEG-1 Audio Layers 1 and 2.

Li & Drew !
Layer 2 of MPEG-1 Audio
Main difference:
Three groups of 12 samples are encoded in each frame and

temporal masking is brought into play, as well as frequency
masking
Bit allocation is applied to window lengths of 36 samples

instead of 12
The resolution of the quantizers is increased from 15 bits

to 16
Advantage:
a single scaling factor can be used for all three groups

Li & Drew !
Layer 3 of MPEG-1 Audio
Main difference:
Employs a similar filter bank to that used in Layer 2,

except using a set of filters with non-equal frequencies
Takes into account stereo redundancy
Uses Modified Discrete Cosine Transform (MDCT) ad-

dresses problems that the DCT has at boundaries of the
window used by overlapping frames by 50%:
N 1
2 N/2 + 1
! " # $ %
F (u) = 2 f (i) cos i+ (u + 1/2) , u = 0, .., N/2 1
i=0
N 2
(14.7)

Li & Drew !
PCM
audio signal
Filter bank: M-DCT Nonuniform

32 subbands quantization
1,024-point Psychoacoustic Side-information

FFT model coding
Bitstream Huffman
Coded audio
signal formatting coding
Fig 14.14: MPEG-Audio Layer 3 Coding.

Li & Drew !
Table 14.2 shows various achievable MP3 compression ratios:
Table 14.2: MP3 compression performance
Sound Quality Bandwidth Mode Compression

Ratio
Telephony 3.0 kHz Mono 96:1
Better than 4.5 kHz Mono 48:1
Short-wave
Better than 7.5 kHz Mono 24:1
AM radio
Similar to 11 kHz Stereo 26 - 24:1
FM radio
Near-CD 15 kHz Stereo 16:1
CD > 15 kHz Stereo 14 - 12:1

Li & Drew !
14.5 Further Exploration

Link to Further Exploration for Chapter 14.
In Chapter 14 the Further Exploration section of the text web-

site, a number of useful links are given:
Excellent collections of MPEG Audio and MP3 links.
The official MPEG Audio FAQ
MPEG-4 Audio implements Tools for Large Step Scala-

bility, An excellent reference is given by the Fraunhofer-
Gesellschaft research institute, MPEG 4 Audio Scalable Pro-
file.

Li & Drew !

MPEG Audio Compression Fundamentals

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MPEG Audio Compression Fundamentals

Uploaded by

Copyright:

Available Formats

Fundamentals of Multimedia, Chapter 14

1 c Prentice Hall 2003

The range of human hearing is about 20 Hz to about 20 kHz

The frequency range of the voice is typically only from about

The dynamic range, the ratio of the maximum sound ampli-

2 c Prentice Hall 2003

Lossy audio data compression methods, such as MPEG/Audio

The general situation in regard to masking is as follows:

5 c Prentice Hall 2003

A plot of the threshold of human hearing for a pure tone

Fig. 14.2: Threshold of human hearing, for pure tones

6 c Prentice Hall 2003

Threshold of Hearing (contd)

The threshold of hearing curve: if a sound is above the dB

Turning up a tone so that it equals or surpasses the curve

An approximate formula exists for this curve:

7 c Prentice Hall 2003

Frequency Masking Curves

Frequency masking is studied by playing a particular pure

one would generate a 1 kHz masking tone, at a fixed

8 c Prentice Hall 2003

Fig. 14.3: Effect on threshold for 1 kHz masking tone

9 c Prentice Hall 2003

Fig. 14.4: Effect of masking tone at three different frequencies

10 c Prentice Hall 2003

Phenomenon: any loud tone will cause the hearing receptors

The following figures show the results of Masking experi-

16 c Prentice Hall 2003

17 c Prentice Hall 2003

0 Tones below surface

Fig. 14.7: Effect of temporal and frequency maskings depending

14.2 MPEG Audio

MPEG audio compression takes advantage of psychoa-

MPEG Audio Overview

20 c Prentice Hall 2003

MPEG audio offers three compatible layers :

Each succeeding layer able to understand the lower layers

Each succeeding layer offering more complexity in the psy-

each succeeding layer, with increased compression effec-

The objective of MPEG layers: a good tradeoff between

21 c Prentice Hall 2003

MPEG Layers (contd)

Layer 1 quality can be quite good provided a comparatively

Digital Audio Tape typically uses Layer 1 at around 192 kbps

Layer 2 has more complexity; was proposed for use in Digital

Layer 3 (MP3) is most complex, and was originally aimed at

Most of the complexity increase is at the encoder, not the

22 c Prentice Hall 2003

MPEG Audio Strategy

MPEG approach to compression relies on:

MPEG encoder employs a bank of filters to:

23 c Prentice Hall 2003

MPEG Audio Strategy (contd)

Frequency masking: by using a psychoacoustic model to

24 c Prentice Hall 2003

MPEG Audio Compression Algorithm

Fig. 14.9: Basic MPEG Audio encoder and decoder.

25 c Prentice Hall 2003

Basic Algorithm (contd)

The algorithm proceeds by dividing the input into 32 fre-

In the Layer 1 encoder, the sets of 32 PCM values are first

Fig.14.11 shows how samples are organized