You are on page 1of 15

Title/number:

DSP
by
Andreas Spanias, Ph.D.

Speech Processing

spanias@asu.edu

Phone: 480 965 1837, Fax: 480 965 8325


http://www.eas.asu.edu/~spanias

Copyright (c) Andreas Spanias 10-1

Topics

1. Speech Spectrum and Source System Coders

2. Speech Processing Analysis-Synthesis Algorithms

3. Historical Perspective on Algorithmic Research

4. The Standards on Speech Coding

5. Algorithm Examples

6. Remarks

Copyright (c) Andreas Spanias 10-2


Voiced and Unvoiced Speech
1.0 Time domain speech segment 50
fundamental
TAPE TIME: 8014 frequency

20 Formant Structure
0.0
Amplitude

Magnitude (dB)
-1.0 -20
0 8 16 24 32 0 1 2 3 4
Time (mS) Frequency (KHz)

1.0 Time domain speech segment 40

TAPE TIME: 3840

20

0.0
0
Amplitude

Magnitude (dB) -30


-1.0
0 8 16 24 32 0 1 2 3 4
Time (mS) Frequency (KHz)

Copyright (c) Andreas Spanias 10-3

Fine (Pitch) and Formant Structure of the


Short-time Speech Spectrum

Fine Harmonic Structure : reflects the quasi-periodicity of


speech and is attributed to the vibrating vocal chords.

Note the narrow peaks

Formant Structure (Spectral Envelope): is due to the


interaction of the source and the vocal tract. The vocal tract
consists of the pharynx and the mouth cavity.
Note the envelope peaks

Copyright (c) Andreas Spanias 10-4


Formants
Formants: peaks of the spectral envelope representing the resonant
modes of the vocal tract. 3-5 formants below 5 kHz.The first 3 formants,
usually occurring below 3 kHz, are quite important both in speech synthesis
and perception. Higher formants are important for wideband and unvoiced
speech representations.

f1 f2 f3 f4 f5

Copyright (c) Andreas Spanias 10-5

Speech Analysis/Synthesis
Speech analysis-synthesis: speech is analyzed (represented)
in terms of a compact parametric set which is then used for
speech synthesis. Speech coding at medium-rates and below
is achieved using an analysis-synthesis process.

Closed-loop analysis or analysis-by-synthesis: In closed-loop


analysis, the parameters are extracted and encoded by minimizing
explicitly the difference between the original and reconstructed
speech. CELP typed algorithms belong to this category.
Closed-loop analysis is usually high complexity.

Open-loop analysis: In open-loop analysis, the parameters


are extracted and encoded without considering the difference
between the original and the reconstructed speech.
Copyright (c) Andreas Spanias 10-6
Speech Synthesis Model (1)

x =

f f f

X(z) S(z)
1/A(z)

* =

t t t

Copyright (c) Andreas Spanias 10-7

Simple Speech Synthesis Model (2)


Requires “hard” (binary)
Pitch 
info voicing

V/UV

VOCAL SYNTHETIC
gain TRACT
SPEECH
FILTER

b0
H ( z)  M
1 ai z i
i 1

Copyright (c) Andreas Spanias 10-8


Speech Analysis-by-Synthesis (closed-loop)

Frequency responses Synthesis speech is


of the two synthesis
filters
forced to match i/p speech

s(n)

+
^
Select + + s(n)
-
or Form gain
Excitation
+ +

A (z) A(z)
L

LTP LP

MSE W(z)

Copyright (c) Andreas Spanias 10-9

LTP excited by a random signal creates pseudo-periodicity

1
1  0.95 z 30

Impulse response Frequency response


Magnitude Response (dB)

10

-10
0 0.5 0.9 1

Normalized frequency (Nyquist = 1)

Copyright (c) Andreas Spanias 10-10


Subjective Speech Quality
Broadcast
Broadcast wideband speech refers to high quality
“commentary” speech at rates above 64 kbits/s.

Network or toll
Toll or Network quality refers to quality comparable
to the classical analog speech (200-3200 Hz)
Communications
Communications quality implies somewhat degraded
speech quality but adequate for cellular communications.
Synthetic
Synthetic speech is usually intelligible but can be
unnatural and associated with a loss of speaker recognizability.

Copyright (c) Andreas Spanias 10-11

The Mean Opinion Score

MOS Scale Speech Quality


1 Bad
2 Poor
3 Fair
4 Good
5 Excellent

Copyright (c) Andreas Spanias 10-12


The Mean Opinion Score (2)

The MOS range relates to speech quality as follows :

MOS 4.0 - 4.5 : network or toll quality

MOS 3.5 - 4.0 : communications quality

MOS 2.5 - 3.5 : synthetic quality

Remarks : MOS ratings may differ significantly from test to


test and hence they are not absolute measures for the
comparison of different coders.

Copyright (c) Andreas Spanias 10-13

First Generation Analysis-by-Synthesis LPC

This class includes: IS-54 VSELP, RPE-LTP GSM, FS-1016,


LD-CELP G.728, IS-96 QCELP

Mostly Encode Reflection Coefficients or LARs

Employ for the most part full searches of the code books and
LTPs

High MIPS (most of them 20 MIPS+)

Modest MOS (~ 3.5)

Copyright (c) Andreas Spanias 10-14


Code Excited Linear Prediction (CELP)
Codebook s(n) s (n)
W
W(z)

x (n) ^ +
C s (n)
W
+ +

g W(z)
...
... + +
-

... A (z) A(z)


L

e (n)
VQ index C
Error
Minimization

- produced low-rate coded speech comparable to that of medium-rate waveform


coders
- bridged the gap between waveform coders and vocoders
- codebook originally consisted of Gaussian sequences; 1024 vectors 40-samples
(5ms) long
- gain scales excitation vector and excitation filtered by LTP and L synthesis
- “optimum” vector selected such that the perceptually weighted MSE minimized.

Copyright (c) Andreas Spanias 10-15

Code Excited Linear Prediction (2)


The Nx1 error vector

e c k   s w  sˆ w0  g k sˆ w k 

sˆw0 output due to the initial filter state,

Minimizing  c k   e cT k e c k  w.r.t. gk we get

swT sˆw k 
gk  T
sˆw k sˆw k 

Copyright (c) Andreas Spanias 10-16


Code Excited Linear Prediction (3)

 c k   s s w  T w
T sˆ k  s T
w 2

sˆ w k sˆ w k 
w

The k-th excitation vector, X c k  , that minimizes c k is selected

closed-loop analysis is used for LTP parameters; range of values for 


within the integers 20 to 147

M.R. Schroeder and B. Atal, "Code-Excited Linear Prediction (CELP): High Quality Speech at
Very Low Bit Rates," Proc. ICASSP-85, p. 937, Tampa, Apr. 1985.

Copyright (c) Andreas Spanias 10-17

Lag Index The IS-54 and GSM VSELP

Long Term
ga
Filter State

+ speech
Codebook 1 g1 Postfilter

VQ-1 Index
A(z)

Codebook 2 g2

VQ-2 Index
- developed by Motorola - part of IS-54 and GSM cellular standards
- speech sampled at 8 kHz - segmented in 20ms frames - sub-frames of 5 ms
- complexity estimated at 30 MIPS - - MOS 3.45

I. Gerson and M. Jasiuk, "Vector Sum Excited Linear Prediction (VSELP) Speech Coding at 8 kbits/s,"
Proc. ICASSP-90, pp. 461-464, New Mexico, Apr. 1990.

A. Spanias, M. Deisher, P. Loizou and G. Lim, "Fixed-Point Implementation of the VSELP algorithm,
ASU-TRC Technical Report, TRC-SP-ASP-9201, July 1992.

Copyright (c) Andreas Spanias 10-18


Second Generation Analysis-by-Synthesis LPC
This class includes: G.723.1, G.729, CDMA EVRC IS-127, GSM EFR,
IS-641

Encode Line Spectrum Pairs (LSP) using Split Vector Quantizers

Employ for the most part partial searches of the LTPs; usually open loop
estimate refined by closed loop search around the neiborhood of estimate

Codebooks have Algebraic structure (ACELP)

High MIPS (most of them 20 MIPS)

Provisions for channel errors

Very Good MOS (~ 3.8+)

Copyright (c) Andreas Spanias 10-19

The CDMA IS-127 Enhanced Variable Rate Coder


(EVRC) Algorithm
- It is an RCELP (Relaxed CELP) algorithm
- Different than classical CELP in that a time-warped (downsampled) version of
the residual is matched instead of actual speech
- Operates at 3 rates 9.6/4.8/1.2 kbits/s and also blank
- Unless requested by network rate is determined based on voice activity
- Upon command it may generate blank or Rate 1/2
- includes an FFT-based speech enhancement pre-processing block
- Estimated Pitch at higher rates has to conform with a pitch contour
- No pitch estimation at 1.2 kbits/s
- LPC coefficients encoded as LSPs - subframe LSPs by interpolation
- The random codebook is searched using Algebraic CELP techniques
- includes postfilters
- MOS 3.8 at 9.6 kbits/s and Complexity around 20 MIPS

Copyright (c) Andreas Spanias 10-20


THE GSM ENHANCED FULL-RATE (EFR) CODER

- Bit Rate 12.2 kbits/s


- Speech is sampled at 8 kHz and segmented into 20 ms frames (160 samples)
- 10 LPC parameters determined by Levinson-Durbin and vector quantized as LSPs
- subframes are 5 ms each
- Uses an Algebraic codebook
- The pitch is first estimated open loop and refined using close loop search
much like the IS 641 pitch search

GSM: Enhanced Full Rate Speech Transcoder, ETSI GSM 6.60, Nov. 1996

Vendors for EFR GSM Coder


(figures are approximate - check with the vendor for more accurate estimates)
- VLSI VWS22030 based on the DSPGroup OakDSPcore, contact VLSI Inc., (www.vlsi.com)

Copyright (c) Andreas Spanias 10-21

Third Generation Vocoders

Some Recent and Ongoing Standardization Efforts

CDMA 2000 (supports next generation data services


envisioned up to 2MB/s)
GSM AMR - Adaptive Multirate Speech Coder (multiple
coders)
ITU-4 - ITU Standardization Efforts for 4kb/s (on-going)
CDMA SMV - Selectable Mode Vocoder for the next
generation CDMA

Copyright (c) Andreas Spanias 10-22


Wideband CDMA
Objective to meet IMT 2000 requirements (at least 144 Kb/s in a vehicular
environment, 384 Kb/s in a pedestrian environment, and 2048 Kb/s in an indoor
office environment)
To supports next generation data services envisioned up to 2MB/s (Full coverage
and mobility for 144 Kb/s, preferably 384 Kb/s - Limited coverage and mobility
for 2 Mb/s)
Enhanced Voice Services (audioconferencing & voice mail)
Concurrent high-quality video/audio
Backward compatible with IS-95B
high security & low power
Significantly enhanced version of EVRC for voice services
- http://www.comsoc.org/pubs/surveys/4q98issue/prasad.html
- D. Knisely et al, Evolution of Wireless Data Services: IS-95 to CDMA 2000, IEEE Communications Magazine, pp. 140-149, October 1998
- IS-95 CDMA and cdma2000: Cellular/PCS Systems Implementation, 1/e, Vijay K. Garg, University of Illinois, Chicago, Illinois Published
December, 1999 by Prentice Hall PTR (ECS Professional)

Copyright (c) Andreas Spanias 10-23

GSM Adaptive Multirate Coder

Adjusts its bit-rate according to network load


Rates 12.2, 10.2, 7.95, 6.7, 5.9, 5.15, 4.75kb/s
Based on CELP with 20 ms frame and 5 ms subframe
Multirate-ACELP with 10th order short-term LPC and perceptual
weighting (uses levinson)
Encodes LSPs using split VQ
An open loop LTP is first obtained and refined by closed loop
Highest bit rate provides toll quality & half rate provides communications
quality

- ETSI TS 126 090 V.3.1.0 2000-01 - AMR SPEECH CODEC TRANSCODING FUNCTIONS 3G-TS 26.090 Technical Specification
- R. Ekudden, R. Hagen, I. Johansson, and J. Svedburg, "The Adaptive Multi-Rate speech coder, Proc. IEEE Workshop on
Speech Coding, pp. 117-119, 1999

Copyright (c) Andreas Spanias 10-24


The Selectable Mode Vocoder

• Algorithm to provide higher quality, flexibility, and capacity over existing IS-96C, IS-
127 EVRC, and IS-733 (that replaced IS-96C but working at higher average rate)
• The Conexant SMV algorithm became the core technology for 3G CDMA (core SMV
algorithm to be refined in the interim by participating companies according to the
publication below)
• Based on 4 codecs: full rate at 8.5 kbps, half rate at 4 kbps, quarter rate at 2 kbps, and
eighth rate at 800 bps
• Pre-processing includes noise suppression similar to IS 127 EVRC
• Full rate and half rate based on Conexants eXtended CELP (eX-CELP) a core
technology also used in the ITU G.4 Conexant submission to ITU-4
• Performed better than IS-733 and IS-127 in tests with and without background noise
• Scored as high as 4.1 MOS at full rate with clean speech. Performed very well with
background noise
REFERENCES:
[1] “The SMV algorithm selected for TIA and 3GPP2 for CDMA applications,” conference paper by Conexant systems, Y.Gao, E.
Schlomot, A. Benyassine, J. Thyssen, H. Su, and C Murgia (portions published at ICASSP-2001)

Copyright (c) Andreas Spanias 10-25

STANDARDS AT A GLANCE

• ITU Wideband Coding


– G.722 Coding of 7 kHz speech at 64, 56,48 kbps - Sub-band ADPCM
– G.WB1 Coding of 7 kHz speech at 32/ 24 kbps - Combined Transform and CELP coding
– G.WB2 Coding of 7 kHz speech at 16 kbps or less (ongoing)

• ITU Telephony
– G.711 PCM (64 kbps) late 60’s
– G.726 ADPCM (32/40/ 24/16 kbps) 1988
– G.728 LD-CELP coding (16 kbps) 1992
– G.723.1 True Speech (5.3/6.3 kbps) 1995
– G.729 CS-ACELP (8/12.8/6.4 kbps) 1996 and Annex in 1998
– G.4kbps Toll quality at 4 kbps (on going)

• Non-ITU
– MPEG1/Audio (includes MP3), 1991
– MPEG2/Audio: 64 kbps (1992)
– MPEG4/Audio: audio/speech coding at bit rates between 64 and 2 kbps (1998)
– MPEG7/Audio: audio/speech/MIDI coding (ongoing)

Copyright (c) Andreas Spanias 10-26


STANDARDS AT A GLANCE (2)
• TIA
– CDMA
• IS96 8,4,2 kbps Q-CELP (Qualcomm CELP, 1992)
• IS127 8.55, 4, 0.8 kbps EVRC (Enhanced Variable. Rate Coder, 1996)
• IS733 13.3, 6.2, 2.7, 1 kbps VRC (Variable Rate Coder, 1998)
• 3GPP2 0.8-8.55 kbps SMV (Selectable Mode Vocoder, 2001)
– TDMA
• IS54 7.95 kbps VSELP (Vector-Sum Excitation Linear.Predictor., 1989)
• IS641 7.4 kbps CELP (Similar to EFR but at lower rate, 1997)
– PCS1800 (GSM variant working at 1800 MHz)
• IS136-410 12.2 kbps US1 (1999)

• ETSI (GSM):
– 13 kbps RPE-LTP (Full rate GSM, 1988)
– 6.5 kbps VSELP (Half-rate GSM, 1993)
– 12.2 kbps EFR (Enhanced full-rate GSM, 1996)
– 12.2 - 4.75 kbps AMR (Adaptive Multi Rate, 1999)

• ARIB Japan
– Full-rate PDC (Personal Digital Communication) 6.7 kbps VSELP
– Half-rate PDC 3.45 kbps Multimode CELP`

Copyright (c) Andreas Spanias 10-27

Vocoder/Waveform/Hybrid

MOS PCM
Hybrid Coders ADPCM
1-5 SMV
CELP
Waveform Coders

MELP

LPC10e

Vocoders

1 2 4 8 16 32 64

Bit rate (kbps))

Copyright (c) Andreas Spanias 10-28


PERFORMANCE OF SOME STANDARDIZED ALGORITHMS

Algorithm Bit Rate MOS Complexity Framesize (ms)


(kbits/s) (MIPS)

PCM G.711 64 4.3 0.01 0


+
ADPCM G.726 32 4.1 2 0.125
SBC G.722 48/56/64 4.1 5 0.125
LD-CELP G.728 16 4 ~30 0.625
CS-ACELP G.729 8 4 ~20 10
CS-ACELP-A G.729 8 3.76 11 10
MPC-MLQ G.723.1 6.3/5.3 3.98/3.7 ~16 30
GSM FR RPE-LTP 13 3.7 (ave) 5 20
GSM EFR 13 4 14 20
GSM HR VSELP 6.3 ~3.4 14 20
IS-54 VSELP 8 3.5 14 20
IS-641 EFR 8 3.8 14 20
Conexant eX-CELP SMV 8.55/4/2/0.8 ~4.1 (8.55) ~20 MIPS 20
IS-96 QCELP 1.2/2.4/4.8/9.6 3.33 (9.6) 15 20
IS-127 EVRC 1.2/4.8/9.6 ~3.8 (9.6) 20 20
PDC VSELP 6.3 3.5 14 20
PDC PCI-CELP 3.45 ~3.4 ~48 40
FS 1015 – LPC 10e 2.4 2.3 7 22.5
FS 1016 – CELP 4.8 4.8 3.2 16 30
MELP 2.4 3.2 ~30 22.5
Inmarsat-B APC 9.6/12.8 ~3.1/3.4 10 20
Inmarsat-M IMBE 6.3 3.4 ~13 20

Copyright (c) Andreas Spanias 10-29

You might also like