You are on page 1of 38

Speech Coding Techniques

4/7/2003

Introduction

Efficient speech-coding techniques

Advantages for VoIP


Digital streams of ones and zeros
The lower the bandwidth, the lower the
quality

RTP payload types


Processing power

The better quality (for a given


bandwidth) uses a more complex
algorithm
A balance between quality and cost

Voice Quality

Bandwidth is easily quantified

Voice quality is subjective

MOS, Mean Opinion Score

ITU-T Recommendation P.800

Excellent 5
Good 4
Fair 3
Poor 2
Bad 1

A minimum of 30 people
Listen to voice samples or in conversations

P.800 recommendations

The selection of participants


The test environment
Explanations to listeners
Analysis of results

Toll quality

A MOS of 4.0 or higher

About Speech

Speech

Model the vocal tract as a filter

Air pushed from the lungs past the


vocal cords and along the vocal tract
The basic vibrations vocal cords
The sound is altered by the disposition
of the vocal tract ( tongue and mouth)
The shape changes relatively slowly

The vibrations at the vocal cords

The excitation signal

Speech sounds

Voiced sound

Unvoiced sounds

The vocal cords vibrate open and close


Quasi-periodic pulses of air
The rate of the opening and closing the pitch
Forcing air at high velocities through a constriction
Noise-like turbulence
Show little long-term periodicity
Short-term correlations still present

Plosive sounds

A complete closure in the vocal tract


Air pressure is built up and released suddenly

Voice Sampling

Discrete Time LTI Systems: The


Convolution Sum
x[n]

x[k ] [n k ]

x[k ]h[n k ]

y[n]

k
1

h[n]
n

0 1 2
2

0.5

0 1

x[n]

2.5

0.5

0 1 2 3

y[n]
n

Nyquist sampling theorem


X c ( j )

s (t )

(t nT )

n
N

xs (t ) xc (t ) s (t )

0 X c ( j )

xc (t ) (t nT )
n

2
S ( j )
T
S

S
( S N )

( k )

Quantization (Scalar
Quantization)
v1

m0= -A

vk+1

v2

m1

m2

mk

mk+1

vL

mL1

mL=A

k+1
Assume | x[n] | A
divide the range [ A , A ] into L quantization levels
{ J1 , J2 , Jk ,.. JL }
Jk : [mk-1,mk ]

R
L=2

each quantization level Jk is represented by a value vk

S = U Jk , V = { v1 , v2 , vk ,.. vL }

Non-Uniform Quantization
m0 = -A

m1

m2

mL=A

Concept : small quantization levels for


small x
large quantization levels for large x
Goal: constant SNRQ for all x

Companding
x[n]

F(x)

Uniform
Quantization

Uniform
Decoder

F1(x)

Compressor
11011101
Expandor
Compressor + Expandor Compandor
F(x) is to specify the non-uniform
quantization characteristics

^
x[n]

Non-Uniform Quantization

-law
F ( x)

log 1 x

A-law

log( 1 )

F ( x)

,0 x 1

Ax
1
,0 x
1 lnA
A
1 ln[ A x ] 1
,
x 1
1 lnA
A

Typical values in practice


= 255 , A = 87.6

Types of Speech Codecs

Waveform codecs,source codecs


(also known as vocoders),and
hybrid codecs.

Speech Source Model and


Source Coding
G(z), G(), g[n]
unvoiced

random
sequence
generator
periodic
pulse
train
generator

G
v/u
voiced

u[n]

Excitation parameters

G(z) =
P

1 akz-k
k=1

Vocal Tract Model

x[n]v/u : voiced/ unvoiced


N : pitch for voiced
G : signal gain
excitation signal u[n]
Vocal Tract parameters

Excitation

A good approximation,
though not precise
enough

{ak} : LPC coefficients


formant structure of
speech signals

LPC Vocoder(Voice Coder)


x[n]

LPC
Analysis

{ ak }
N,G
v/u

Encoder
11011

N by pitch detection
v/u by voicing
detection
receiver
Decoder
11011

{ ak }
N,G
v/u

Ex

g[n]
G(z)

x[n]

{ak} can be non-uniform or vector


quantized to reduce bit rate

G.711

The most commonplace codec


Used in circuit-switched telephone
network
PCM, Pulse-Code Modulation
If uniform quantization
12 bits * 8 k/sec = 96 kbps
Non-uniform quantization
65 kbps DS0 rate
law

North America
A-law

Other countries, a little friendlier to


lower signal levels
An MOS of about 4.3

ADPCM(adaptive
differential PCM)

DPCM and ADPCM.

ADPCM : Adaptive Prediction in DPCM


Adaptive Quantization
Adaptive Quantization

Quantization level varies with local signal level


[n] = ax[n]
x[n] : locally estimated standard deviation of x[n]

G.721:ADPCM-coded speech at 32Kbps.


G.726(A-law or law )

16,24,32,40Kbps
MOS 4.0 , at 32Kbps

Analysis-by-Synthesis
(AbS)
Codecs
Hybrid codec

Fill the gap between waveform and source


codecs
The most successful and commonly used

Time-domain AbS codecs

Not a simple two-state, voiced/unvoiced

Different excitation signals are attempted

Closest to the original waveform is selected

MPE, Multi-Pulse Excited

RPE, Regular-Pulse Excited

CELP, Code-Excited Linear Predictive

G.728 LD-CELP

CELP codecs

A filter; its characteristics change over time


A codebook of acoustic vectors
A vector = a set of elements representing various
char. of the excitation
Transmit
Filter coefficients, gain, a pointer to the vector
chosen

Low Delay CELP

Backward-adaptive coder
Use previous samples to determine filter coefficients
Operates on five samples at a time
Delay < 1 ms
Only the pointer is transmitted

1024 vectors in the code book


10-bit pointer (index)
16 kbps

LD-CELP encoder

Minimize a frequency-weighted mean-square error

LD-CELP decoder

An MOS score of about 3.9


One-quarter of G.711 bandwidth

G.723.1 ACELP

6.3 or 5.3 kbps

Both mandatory
Can change from one to another during a conversation

The coder

A band-limited input speech signal


Sampled at 8 KHz, 16-bit uniform PCM quantization
Operate on blocks of 240 samples at a time
A look-ahead of 7.5 ms
A total algorithmic delay of 37.5 ms + other delays
A high-pass filter to remove any DC component

G.723.1 Annex A

The two lsbs of the first octet

Silence Insertion Description (SID)


frames of size four octets
00
01
10

6.3kbps 24 octets/frame
5.3kbps 20
SID frame 4

An MOS of about 3.8

At least 37.5 ms delay

G.729

8 kbps
Input frames of 10 ms, 80 samples for 8
KHz sampling rate
5 ms look-ahead

Algorithmic delay of 15 ms

An 80-bit frame for 10 ms of speech


A complex codec

G.729.A (Annex A), a number of


simplifications
Same frame structure
Encoder/decoder, G.729/G.729.A
Slightly lower quality

G.729.B

VAD, Voice Activity Detection

DTX, Discontinuous Transmission

Based on analysis of several parameters of the


input
The current frames plus two preceding frames
Send nothing or send an SID frame
SID frame contains information to generate
comfort noise

CNG, Comfort Noise Generation

G.729, an MOS of about 4.0


G.729A an MOS of about 3.7

Other Codecs

CDMA QCELP defined in IS-733

Variable-rate coder
Two most common rates

The high rate, 13.3 kbps


A lower rate, 6.2 kbps

Silence suppression
For use with RTP, RFC 2658

GSM Enhanced Full-Rate (EFR)

GSM 06.60
An enhanced version of GSM Full-Rate
ACELP-based codec
The same bit rate and the same
overall packing structure

12.2 kbps

Support discontinuous transmission


For use with RTP, RFC 1890

GSM Adaptive Multi-Rate (AMR)


codec

GSM 06.90
Eight different modes
4.75 kbps to 12.2 kbps
12.2 kbps, GSM EFR
7.4 kbps, IS-641 (TDMA cellular
systems)
Change the mode at any time
Offer discontinuous transmission

The MOS values are for laboratory


conditions

G.711 does not deal with lost packets


G.729 can accommodate a lost frame
by interpolating from previous frames

But cause errors in subsequent speech


frames

Processing Power

G.728 or G.729, 40 MIPS


G.726 10 MIPS

Cascaded Codecs

E.g., G.711 stream -> G.729


encoder/decoder
Might not even come close to G.729

Each coder only generate an


approximate of the incoming signal

Tones, Signal, and DTMF


The hybrid codecs are optimized for
Digits

human speech

Other data may need to be transmitted


Tones: fax tones, dialing tone, busy tone
DTMF digits for two-stage dialing or voicemail

G.711 is OK
G.723.1 and G.729 can be unintelligible
The ingress gateway needs to intercept

The tones and DTMT digits

Easy at the start of a call


Difficult in the middle of a call

Encode the tones differently form the


speech

Send them along the same media path


An RTP packet provides the name of the tone and
the duration
Or, a dynamic RTP profile; an RTP packet
containing the frequency, volume and the
duration
RFC 2198
An RTP payload format for redundant audio
data
Sending both types of RTP payload

RTP Payload Format for DTMF


Digits

An Internet Draft
Both methods described before
A large number of tones and events

DTMF digits, a busy tone, a congestion


tone, a ringing tone, etc.

The named events

E: the end of the tone, R: reserved

Payload format

Finis

Discrete Time LTI Systems:


The Convolution Sum
x[n]

x[k ] [n k ]

y[n]

x[k ]h[n k ]

k
1

h[n]
n

0 1 2
2

0.5

0 1

x[n]

2.5

0.5

0 1 2 3

y[n]
n

Frequency-Domain
Representation of
Sampling
X c ( j)

s (t )

(t nT )

n
N

xs (t ) xc (t ) s (t )

0 X c ( j)

xc (t ) (t nT )
n

2
S ( j )
T
S

S
( S N )

( k )

Speech Source Model and


Source Coding

Vocal Tract Model


u (n) a x[n k ] x[n]
p

k 1

G( z)

1
p

1 ak z k
k 1

X ( z)

U ( z)

You might also like