You are on page 1of 16

R.J. McAulay and T.F.

Quatieri

Speech Processing Based on


a Sinusoidal Model

Using a sinusoidal model of speech, an analysis/synthesis technique has been devel-


oped that characterizes speech in terms of the amplitudes, frequencies, and phases of
the component sine waves. These parameters can be estimated by applying a simple
peak-picking algorithm to a short-time Fourier transform (STFf) of the input speech.
Rapid changes in the highly resolved spectral components are tracked by using a
frequency-matching algorithm and the concept of "birth" and "death" of the underlying
sinewaves. For a given frequency track, a cubic phase function is applied to a sine-wave
generator. whose output is amplitude-modulated and added to the sine waves generated
for the other frequency tracks. and this sum is the synthetic speech output. The
resulting waveform preserves the general waveform shape and is essentially indistin-
gUishable from the original speech. Furthermore. in the presence of noise the perceptual
characteristics of the speech and the noise are maintained. It was also found that high-
quality reproduction was obtained for a large class of inputs: two overlapping. super-
posed speech waveforms; music waveforms; speech in musical backgrounds; and
certain marine biologic sounds.
The analysis/synthesis system has become the basis for new approaches to such
diverse applications as multirate coding for secure communications, time-scale and
pitch-scale algorithms for speech transformations, and phase dispersion for the
enhancement ofAM radio broadcasts. Moreover, the technology behind the applications
has been successfully transferred to private industry for commercial development.

Speech signals can be represented with a Other approaches to analysis/synthesis that


speech production model that views speech as are based on sine-wave models have been
the result of passing a glottal excitation wave- discussed. Hedelin [3] proposed a pitch-inde-
form through a time-varying linear filter (Fig. 1), pendent sine-wave model for use in coding the
which models the resonant characteristics of baseband signal for speech compression. The
the vocal tract. In many speech applications the amplitudes and frequencies of the underlying
glottal excitation can be assumed to be in one of sine waves were estimated using Kalman fIlter-
two possible states, corresponding to voiced or ing techniques and each sine-wave phase was
unvoiced speech. defined to be the integral of the associated in-
In attempts to design high-quality speech stantaneous frequency.
coders at mid-band rates, more general excita- Another sine-wave-based speech system is
tion models have been developed. Approaches being developed by Almeida and Silva [4]. In
that are currently popular are multipulse [1] and contrast to Hede1in's approach, their system
code-excited linear predictive coding (CELP) [2). uses a pitch estimate to establish a harmonic set
This paper also develops a more general model of sine waves. The sine-wave phases are com-
for glottal excitation, but instead of using puted from the STFf at the harmonic frequen-
impulses as in multipulse, or code-book excita- cies. To compensate for any errors that might be
tionsas in CELP, the excitation waveform is as- introduced as a result of the harmonic sine-
sumed to be composed of sinusoidal compo- wave representation, a residual waveform is
nents of arbitrary amplitudes, frequencies, and coded, along with the underlying sine-wave
phases. parameters.

The Lincoln Laboratory Journal, Volume 1, Number 2 (1988) 153


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

of some of these experiments are discussed and


Pitch
Period pictorial comparisons of the original and syn-

W
thetic waveforms are presented.
, The above sinusoidal transform system (STS)
has found practical application in a number of
Impulse speech problems. Vocoders have been
Train
Generator --- Vocal
developed that operate from 2.4 kbps to 4.8 kbps
providing good speech quality that increases

~
c
Tract
Filter
Speech
Random EXCITATION VOCAL TRACT
Noise I--
Generator
~
s(t)
Sine Wave h(t, T)
Generation e(t)

Fig. 1- The binary model ofspeech production illustrated


here, r~quires pi~ch, voca~-~ract parameters, en~rgy levels,
and vOiced/unvoiced decIsions as inputs.

This paper derives a sinusoidal model for the Fig. 2 - The sinusoidal speech model consists of an exci-
speech waveform, characterized by the ampli- tation and vocal-tract response. The excitation waveform is
characterized by the amplitudes, frequencies, and phases
tudes, frequencies, and phases ofits component C!f the underlying sine waves of the speech; the vocal tract
sine waves, thatleads to a new analysis/synthe- IS modeled by the time-varying linear filter, the impulse re-
sis technique. The glottal excitation is repre- sponse of which is h(t; -r).
sented as a sum ofsine waves that, when applied
to a time-varying vocal-tract filter, leads to the
more or less uniformly with increasing bit rate.
desired sinusoidal representation for speech
In another area the STS provides high-quality
waveforms (Fig. 2).
speech transformations such as time-scale and
A parameter-extraction algorithm has been
pitch-scale modifications. Finally, a large
developed that shows that the amplitudes, fre-
research effort has resulted in a new technique
quencies, and phases of the sine waves can be
for speech enhancement for AM radio
obtained from the high-resolution short-time
broadcasting. These applications are
Fourier transform (STFT), by locating the peaks
considered in more detail later in the text.
of the associated magnitude function. To syn-
thesize speech, the amplitudes, frequencies,
and phases estimated on one frame must be Speech Production Model
matched and allowed to evolve continuously
In the speech production model, the speech
into the set of amplitudes, frequencies, and
waveform, s(t), is modeled as the output of a
phases estimated on a successive frame. These
linear time-varying filter that has been excited
issues are resolved using a frequency-matching
by the glottal excitation waveform, eft). The filter,
algorithm in conjunction with a solution to the
which models the characteristics of the vocal
phase-unwrapping and phase-interpolation
tract, has an impulse response denoted by h(t; -r).
problem.
The speech waveform is then given by
A system was built and experiments were
performed with it. The synthetic speech was
judged to be of excellent quality - essentially s(t) = ft
o
hlt- T; t) e(r) dr. (1)
indistingUishable from the original. The results

154 TIle Lincoln Laboratory Journal, Volume 1, Number 2 (1988)


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

If the glottal excitation waveform is represented phases of the component sine waves must be
as a sum of sine waves of arbitrary amplitudes. extracted from the original speech waveform.
frequencies. and phases. then the model can be
written as Estimation of Speech Parameters
eft) = Re ~ al(t) exp {j [~[ WI (a) da + <PI]} (2) The key problem in speech analysis/synthe-
sis is to extract from a speech waveform the
parameters that represent a quasi-stationary
where. for the l th sinusoidal component. Clz{t) portion of that waveform. and to use those
and wlt) represent the amplitude and frequency parameters (or coded versions ofthem) to recon-
(Fig. 2). Because the sine waves will not neces- struct an approximation that is "as close as
sarily be in phase. f/J1 is included to represent a possible" to the original speech. The parameter-
fIxed phase-offset. This model leads to a particu- extraction algorithm. or estimator. should be
1arly simple representation for the speech wave- robust. as the parameters must often be ex-
form. Letting tracted from a speech signal that has been con-
taminated with acoustic noise.
H(w; t) = M(w; t) exp lj<I>(w; t)l (3) In general. it is diffIcult to determine analyti-
cally which of the component sine waves and
represent the time-varying vocal-tract transfer their amplitudes. frequencies. and phases are
function and assuming that the excitation para- necessary to represent a speech waveform.
meters given in Eq. 2 are constant throughout Therefore. an estimator based on idealized
the duration of the impulse response ofthe filter speech waveforms was developed to extract
in effect at time t. then using Eqs. 2 and 3 in Eq. these parameters. As restrictions on the speech
1 leads to the speech model given by waveform were relaxed in order to model real
L(t)
speech better. adjustments were made to the es-
s(t) = ~ al(t) M!wI(t); tJ timator to accommodate these changes.
1=1 In the development of the estimator. the time
(4) axis was first broken down into an overlapping
sequence of frames each of duration T. The
center of the analysis window for the k th frame
occurs at time t k . Assuming that the vocal-tract
By combining the effects of the glottal and vocal- and glottal parameters are constant over an
tract amplitudes and phases. the representa- interval of time that includes the duration of the
tion can be written more concisely as analysis window and the duration of the vocal-
L(t) tract impulse response. then Eq. 7 can be writ-
s(t) = ~AI(t) exp ljl/!I(t)] (5) ten as
1=1
(8)
where
where the superscript kindicates that the para-
(6) meters of the model may vary from frame to
frame. Using Eq. 8. in Eq. 5 the synthetic speech
l/!I(t) = f o
t wda) da + <I>[wI(t); tl + <PI (7) waveform over frame k can be written as

represent the amplitude and phase of the lth (9)


sine wave along the frequency track wlt). Equa-
tions 5. 6. and 7 combine to provide a sinusoidal where
representation ofa speech waveform. In order to k
Yk-A
I
( 'Ok)
- I exp J I (9A)
use the model, the amplitudes. frequencies. and

The Lincoln Laboratory Journal. Volume 1, Number 2 (l988) 155


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

represents the 1th complex amplitude for the 1th


component of the L k sine waves. Since the
measurements are made on digitized speech,
sine (w~ - w~) =sine [(l- i) w~l ={ 6 if l = i
if l ~ i
(13)

sampled-data notation [s(n)] is used. In this


k
where w Ik = lwo' Then the error expression re-
respect the time index n corresponds to the
uniform samples of t - tk ; therefore n ranges from duces to
-N/2 to N/2. with n = 0 reset to the center of the
analysis window for every frame and where N + 1
is the duration of the analysis window. The
fk = ~n
Iy(n)
2
1 - 2(N + 1) Re I~ ~1
(')'k) * y(w k )
I 1
I
problem now is to fit the synthetic speech wave- ~ (14)
+ (N+ 1) ~ 1'Y~12
form in Eq. 9 to the measured waveform. de- 1=1
noted by y(n). A useful criterion for judging the where
quality of fit is the mean-squared error
fk = ~ Iy(n) _ s(n) ,2 Ylw) = N: 1~ n
y(n) exp (-jnw) (15)
n (10)

= ~ Iy(n) ~ ~ 2
,2 _ 2 Re y(n) s*(n) + Is(n) 1 . is the STFT of the measurement signal. By
n n n completing the square in Eq. 14. the error can be
Substituting the speech model of Eq. 9 into written as
Eq. 10 leads to the error expression
k
L
2
fk = ~ Iy(n) 1 - 2 Re ~ ('Y~)* ~ y(n) exp (-jnw~) L
k (16)
n 1=1 n
+(N+ 1) ~ [IYlw~)-'Y~12-IYlW~)12I.
Lk Lk (11) 1=1
+ (N + 1) ~ ~ 'Y k ('Y 1k )* sine (w Ik -
~ ~ I
wk)
1=1 1=1
1 from which it follows that the optimal estimate
for the amplitude and phase is
where sinc(x) =sin [(N+ l)x/21/[(N + 1) sin (x/2)].
The task of the estimator is to identify a set of .y~ = Y(lw;) (17)
sine waves that minimizes Eq. 11. Insights into
the development of a suitable estimator can be which reduces the error to
obtained by restricting the class ofinput signals
to the idealization of perfectly voiced speech, ie,
speech that is periodic, hence having compo-
nent sine waves that are harmonically related.
In this case the synthetic speech waveform can From this calculation it follows. therefore, that
be written as the error is minimized by selecting all of the har-
L
k monic frequencies in the speech bandwidth,
s(n) = ~ 'Y~ exp Unlw~) (12) fl (ie, L k =fl/w~).
1=1 Equations 15 and 17 completely specify the
structure of the ideal estimator and show that
where w~ =21T/T~ and where T~ is the pitch the optimal estimator depends on the speech
period assumed to be constant over the duration data through the STFT (Eq. 15). Although these
of the k th frame. For the purpose of establishing results are eqUivalent to a Fourier-series repre-
the structure of the ideal estimator. it is further sentation of a periodic waveform, the results
assumed that the pitch period is known and that lead to an intuitive generalization to the more
the width of the analysis window is a multiple of practical case. This is done by considering the
T~ Under these highly idealized conditions. functionlY (w) 12 to be a continuous function of w.
the sinc (.) function in the last term of Eq. 11 For the idealized voiced-speech case, this
reduces to function (called a periodogram) will be pulse-

156 The Lincoln Laboratory Journal. Volume 1. Number 2 (l988)


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

like in nature, with peaks occurring at all of the nated, by using the weighted STFf. Letting Y"(w)
pitch harmonics. Therefore, the frequencies of denote the weighted STFf, ie,
the underlying sine waves correspond to the lo- NI2
cation of the peaks of the periodogram, and the Y(w) = ~ w(n) y(n) exp (-jnw) (22)
estimates of the amplitudes and phases are n=-NI2
obtained by evaluating the STFf at the frequen-
cies associated with the peaks of the periodo- where w(n) represents the temporal weighting
gram. This interpretation permits the extension due to the window function, then the practical
of the estimator to a more generalized speech version of the idealized estimator estimates the
waveform, one that is not ideally voiced. This frequencies of the underlying sine waves as the
extension becomes evident when the STFf is locations of the peaks of I(w) I. Letting these
calculated for the general sinusoidal speech frequency estimates be denoted by {w~}, then
model given in Eq. 9. In this case the STFf is the corresponding complex amplitudes are
simply given by
k
L
(23)
Y(w) = ~ 'Y~ sine (w~ - w) . (19)
1=1
Assuming that the component sine waves have
Provided the analysis window is "wide enough" been properly resolved, then, in the absence of
that noise, A~ will yield the value of an underlying
(20) sine wave, provided the window is scaled so that
NI2
then the periodogram can be written as ~ w(n) = 1 . (24)
k n=-NI2
L
2
IY(w)1 = ~ 1'Y~12sine2(w~-w), (21)
Using the Hamming window for the weighted
1=1
STFf provided a very good sidelobe structure in
and, as before, the location of the peaks of the that the leakage problem was eliminated; it did
periodogram corresponds to the underlying so at the expense of broadening the main lobes
sine-wave frequencies. The STFf samples at of the periodogram. In order to accommodate
these frequencies correspond to the complex this broadening, the constraint implied by Eq.
amplitudes. Therefore, provided Eq. 20 holds, 20 must be revised to require that the window
the structure of the ideal estimator applies to a width be at least 2.5 times the pitch period. This
more general class of speech waveforms than revision maintains the resolution features that
perfectly voiced speech. Since, dUring steady were needed to justifY the optimality properties
voicing, neighboring frequencies are approxi- of the periodogram processor. Although the
mately seperated by the width of the pitch window width could be set on the basis of the
frequency, Eq. 20 suggests that the desired instantaneous pitch, the analyzer is less sensi-
resolution can be achieved most of the time by tive to the performance of the pitch extractor if
requiring that the analysis window be at least the window width is set on the basis of the
two pitch periods wide. average pitch instead. The pitch computed
These properties are based on the assumption dUring strongly voiced frames is averaged using
that the sinc (.) function is essentially zero a 0.25-s time constant, and this averaged pitch
outside of the region defmed by Eq. 20. In fact, is used to update, in real time, the width of the
this approximation is not a valid one, because analysis window. During frames of unvoiced
there will be sidelobes outside of this region due speech, the window is held fixed at the value
to the rectangular window implicit in the defini- obtained on the preceding voiced frame or 20
tion ofthe STFf. These sidelobes lead to leakage ms, whichever is smaller.
that compromises the performance of the esti- Once the width for a particular frame has been
mator' a problem that is reduced, but not elimi- specified, the Hamming window is computed

'The Uncoln Laboratory Journal. Volume 1. Number 2 (1988) 157


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

and normalized according to Eq. 24, and the tation applies only to one analysis frame; differ-
STIT of the input speech is taken using a 512- ent sets of these parameters are obtained for
point fast Fourier transform (FIT) for 4-kHz each frame. The next problem to address, then,
bandwidth speech. A typical periodogram for is how to associate the amplitudes, frequencies,
voiced speech. along with the amplitudes and and phases measured on one frame with those
frequencies that are estimated using the above found on a successive frame.
procedure, is plotted in Fig. 3.
In order to apply the sinusoidal model to Frame-to-Frame Peak Matching
unvoiced speech, the frequencies corresponding
to the periodogram peaks must be close enough If the number of periodogram peaks were
to satisfy the requirement imposed by the constant from frame to frame, the peaks could
Karhunen-Loeve expansion [5) for noiselike
signals. If the window width is constrained to be -20 -10 10 20
at least 20 ms wide, on average, the correspond- I I I I I I I
950.0 ms
ing periodogram peaks will be approximately
100 Hz apart, enough to satisfy the constraints
of the Karhunen-Loeve sinusoidal representa-
tion for random noise. A typical periodogram for
a frame of unvoiced speech, along with the
50
estimated amplitudes and frequencies, is plot- x"x
X
ted in Fig. 4. 40 ~

This analysis provides a justification for rep- Xx X


Q)
30 ~
X X X X
resenting speech waveforms in terms of the "0
20 ~X X l( X X0 X
amplitudes, frequencies, and phases of a set of X X
sine waves. However, each sine-wave represen- 10 ~ X X X ~
I I I I I I I
-20 -10 10 20 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
I I I II I I kHz
1270.0 ms
X Denotes Underlying Sine Wave

Fig. 4 - This periodogram illustrates how the power is


shifted to higher frequencies in unvoiced speech. The
amplitudes of the underlying sine waves are denoted with
50 anx.
40 X Xx X
X l( X X
X X X X 'I!- simply be matched between frames on a fre-
Q)
30 I quency-ordered basis. In practice, however,
"0
20
X \ X
there are spurious peaks that come and go
X
10 because of the effects of sidelobe interaction.
(The Hamming window doesn't completely elimi-
o 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 nate sidelobe interaction.) Additionally, peak
kHz
locations change as the pitch changes and there
are rapid changes in both the location and the
X Denotes Underlying Sine Wave number of peaks corresponding to rapidly vary-
ing regions of speech, such as at voiced/un-
Fig. 3 - This is a typicalperiodogram of voiced speech. The voiced transitions. The analysis system can
amplitude peaks of the periodogram determine which fre-
quencies are chosen to represent the speech waveform.
accommodate these rapid changes through the
The amplitudes of the underlying sine waves are denoted incorporation of a nearest-neighbor frequency
with an x. tracker and the concept of the "birth" and

158 The Lincoln Laboratory Journal, Volume 1. Number 2 (1988)


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

"death" of sinusoidal components. A detailed Frequency


description of this tracking algorithm is given in
Ref.6.
An illustration of the effects of the procedure
used to account for extraneous peaks is shown
in Fig. 5. The results of applying the tracker to a
segment of real speech is shown in Fig. 6, which
demonstrates the ability of the tracker to adapt
qUickly through such transitory speech behav-
ior as voiced/unvoiced transitions and mixed
voiced/unvoiced regions.

The Synthesis System

After the execution of the preceding parame-


ter-extraction and peak-matching procedures,
the information captured in those steps can be Time
used in conjunction with a synthesis system to I '---v---" -
~
v
produce natural-sounding speech. Since a set of Unvoiced Segment Voiced
amplitudes, frequencies, and phases are esti- Segment
mated for each frame. it might seem reasonable
to estimate the original speech waveform on the Fig. 6 - These frequency tracks were derived from real
k th frame by using the equation speech using the birth/death frequency tracker.

(25) synthetic speech. To circumvent this problem.


the parameters must be smoothly interpolated
from frame to frame.
where n = O. 1.2, ... , S - 1 and where S is the The most straightforward approach for per-
length of the synthesis frame. Because of th~ forming this interpolation is to overlap and add
time-varying nature ofthe parameters, however. time-weighted segments of the sinusoidal com-
this straightforward approach leads to disconti- ponents. This process uses the measured ampli-
nuities at the frame boundaries. The disconti- tude. frequency. and phase (referenced to the
nuities seriously degrade the quality of the center of the synthesis frame) to construct a sine
wave. which is then weighted by a triangular
>- window over a duration equal to twice the length
u
c:
Q)
of the synthesis frame. The time-weighted com-
::::l
c- ponents corresponding to the lagging edge of the
...
Q)
u.. Death triangular window are added to the overlapping
leading-edge components that were generated
dUring the previous frame. Real-time systems
Birth
using this technique were operated with frames
Death
separated by 11.5 ms and 20.0 ms. respectively.
While the synthetic speech produced by the
first system was quite good. and almost indistin-
Time gUishable from the original speech, the longer
frame interval produced synthetic speech that
Fig. 5 - Different modes used in the birth/death frequency-
track-matching process. Note the death of two tracks during
was rough and, though quite intelligible,
frames one and three and the birth of a track during the sec- deemed to be of poor quality. Therefore, the
ond frame. overlap-add synthesizer is useful only for appli-

The Lincoln Laboratory Journal. Volume 1, Number 2 (1988) 159


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

cations that can support a high frame rate. But


there are many practical applications, such as 1T

speech coding, that require lower frame rates.


These applications require an alternative to the
overlap-add interpolation scheme.
-1T
A method will now be described that interpo-
lates the matched sine-wave parameters di-
rectly. Since the frequency-matching algorithm
associates all ofthe parameters measured for an
arbitrary frame, k, with a corresponding set of 1T W

parameters for frame k + I, then letting -1T

(a)
(25A)
1T
denote the successive sets of parameters for the fJ i (n)

l th frequency track, a solution to the amplitude-


interpolation problem is to take W

~k+l "k -1T


A(n) =Ak + (A S-A) n. (26)

where n = O. I, ... , S - 1 is the time sample into


the k th frame. (The track subscript l has been
1T W
omitted for convenience.)
Unfortunately, this simple approach cannot -1T
be used to interpolate the frequency and phase (b)
because the measured phase, jjk, is obtained
modulo 2n (Fig. 7). Hence, phase unwrapping Fig. 7 - Requirement for unwrapped phase. (a) Interpola-
must be performed to ensure that the frequency tion of wrapped phase. (b) Interpolation of unwrapped
tracks are "maximally smooth.. across frame phase.
boundaries. One solution to this problem that
uses a phase-interpolation function that is a
cubic polynomial has been developed in Ref. 6. (28)
The phase-unwrapping procedure provides
each frequency track with an instantaneous un- and L k is the number of sine waves estimated for
wrapped phase such that the frequencies and the k th frame.
phases at the frame boundaries are consistent
with the measured values modulo 2. The un-
wrapped phase accounts for both the rapid Experimental Results
phase changes due to the frequency of each Figure 8 gives a block diagram description of
sinusoidal component. and to the slowly varying the complete analysis/ synthesis system. A non-
phase changes due to the glottal pulse and the real-time floating-point simulation was devel-
vocal-tract transfer function. oped initially to determine the effectiveness of
Letting iizCtl denote the unwrapped phase the proposed approach in modeling real speech.
function for the l th track. then the fmal syn- The speech processed in the simulation was low-
thetic waveform will be given by pass-filtered at 5 kHz, digitized at 10 kHz, and
L
k analyzed at lO-ms frame intervals. A 512-point
S(n) = I Al(n) cos [~(n)J (27) FIT using a pitch-adaptive Hamming window,
1=1 with a width 2.5 times the average pitch. was
where AzCnl is given by Eq. 26, iizCnl is given by used and found to be sufficient for accurate peak

160 The Uncoln Laboratory Journal Volume 1. Number 2 (1988)


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

ANALYSIS:
Phases
tan- 1(.)
Speech

In~ OFT

Frequencies
Peak
I" Picking
Window Amplitudes

SYNTHESIS:

Phases Frame-To-Frame
Phase 8(t) Synthetic
Sine Wave
_ Frequencies Unwrapping & ----.. Generator
Sum All
Sine Waves
---. Speech
Interpolation Output

_ Amplitudes Frame-To-Frame A(t)


Linear
Interpolation

Fig. 8 - This block diagram of the sinusoidal analysis/synthesis system illustrates the major functions subsumed within the
system. Neither voicing decisions nor residual waveforms are required for speech synthesis.

estimation. The maximum number of peaks that the system was capable of synthesizing a
used in synthesis was set to a fIxed number broad class of signals including multispeaker
(-80); if excess peaks were obtained only the waveforms. music, speech in a music back-
largest peaks were used. ground. and marine biologic signals such as
A large speech data base was processed with whale sounds. Furthermore. the reconstruction
this system, and the synthetic speech was es- does not break down in the presence of noise.
sentially indistinguishable from the original. A The synthesized speech is perceptually indistin-
visual examination of the reconstructed pas- guishable from the original noisy speech with
sages shows that the waveform structure is essentially no modification of the noise charac-
essentially preserved. This is illustrated by Fig. teristics. Illustrations depicting the perform-
9, which compares the waveforms ofthe original ance of the system in the face of the above
speech and the reconstructed speech dUring an degradations are provided in Ref. 6. More re-
unvoiced/voiced speech transition. The com- cently a real-time system has been completed
parison suggests that the quasi-stationary using the Analog Devices ADSP2100 16-bit
conditions imposed on the speech model are met fixed-point signal-processing chips and per-
and that the use of the parametric model based formance equal to that of the simulation has
on the amplitudes, frequencies, and phases of a been achieved.
set of sine-wave components is justified for both Although high-quality analysis/synthesis of
voiced and unvoiced speech. speech has been demonstrated using the ampli-
In another set of experiments it was found tudes, frequencies. and phases of the peaks

The lincoln Laboratory Journal. Volume 1, Number 2 (l988) 161


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

ample of a waveform synthesized by the magni-


tude-only system is shown in Fig. 10 (b). Com-
pared to the original speech, shown in Fig. 10 (a),
the synthetic waveform is quite different be-
cause of the failure to maintain the true sine-
wave phases. In an additional experiment the
magnitude-only system was applied to the syn-
thesis of noisy speech; the synthetic noise took
(a) -120 msl- on a tonal quality that was unnatural and
annoying.

Vocoding
Since the parameters ofthe sinusoidal speech
~----""",,"'--1,oo~".-"~.~"iIIl'lit"H""~M""''''''''''-~''''--- model are the amplitudes, frequencies, and
phases of the underlying sine waves, and since,
for a typical low-pitched speaker there can be as
many as 80 sine waves in a 4-kHz speech
(b) -120 msl-

Fig. 9 - (a) Original speech. (b) Reconstructed speech.


Both voiced and unvoiced segments are compared to the
voiced/unvoiced segment of speech that has been recon-
structed with the sinusoidal analysis/synthesis system. The
waveforms are nearly identical, justifying the use of the
sinusoidal transform model.

of the high-resolution STFf, it is often argued


that the ear is insensitive to phase, a proposition
that forms the basis of much of the work in
(a) -/20 msl-
narrowband speech coders. The question arises
whether or not the phase measurements are
essential to the sine-wave synthesis procedure.
An attempt to explore this question was made by
replacing each cubic phase track by a phase
function that was defined to be the integral ofthe
instantaneous frequency [3,7). In this case the
instantaneous frequency was taken to be the
linear interpolation ofthe frequencies measured (b) -120 msl-
at the frame boundaries and the integration,
which started from a zero value at the birth ofthe
Fig. 10- (a) Original speech. (b) Reconstructed speech.
track and continued to be evaluated along that The original speech is compared to the segment that has
track until the track died. This "magnitude- been reconstructed using the magnitude-only system. The
only" reconstruction technique was applied to differences brought about by ignoring the true sine-wave
phases are clearly demonstrated.
several sentences of speech, and, while the
resulting synthetic speech was very intelligible
and free of artifacts, it was perceived as being bandwidth, it isn't possible to code all of the
different from the original speech, having a parameters directly. Attempts at parameter
somewhat mechanical quality. Furthermore, reduction use an estimate of the pitch to estab-
the differences were more pronounced for low- lish a set ofharmonically related sine waves that
pitched (ie, pitch <-100 Hz) speakers. An ex- provide a best fit to the input speech waveform.

162 TIle Lincoln Laboratory Journal. Volume 1, Number 2 (l988)


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

The amplitudes. frequencies, and phases ofthis product is being produced that will be available
reduced set of sine waves are then coded. Since in 1989.
the neighboring sine-wave amplitudes are natu-
rally correlated, especially for low-pitched Transformations
speakers. pulse-code modulation is used to
encode the differential log amplitudes. The goal oftime-scale modification is to main-
In the earlier versions of the vocoder all the tain the perceptual quality ofthe original speech
sine-wave amplitudes were coded simply by while changing the apparent rate ofarticulation.
allocating the available bits to the number to be This implies that the frequency trajectories of
coded. Since a low-pitched speaker can produce the excitation (and thus the pitch contour) are
as many as 80 sine waves. then in the limit of 1 stretched or compressed in time and that the
bit/sine-wave amplitude, 4000 bps would be re- vocal tract changes at the modified rate. To
quired at a 50-Hz frame rate. For an 8-kbps achieve these rate changes, the system ampli-
channel. this leaves 4.0 kbps for coding the tudes and phases. and the excitation ampli-
pitch. energy, and about 12 baseband phases. tudes and frequencies, along each frequency
However, at 4.8 kbps and below, assigning 1 bit/ track are time-scaled. Since the parameter esti-
amplitude immediately exhausts the coding mates ofthe unmodified synthesis are available
budget. so no phases can be coded. Therefore, a as continuous functions of time. in theory any
more efficient amplitude encoder had to be de- rate change is possible. Rate changes ranging
veloped for operation at these lower rates. between a compression oftwo and an expansion
The increased efficiency is obtained by allow- oftwo have been implemented with good results.
ing the channel separation to increase logarith- Furthermore, the natural quality and smooth-
mically with frequency, thereby exploiting the ness of the original speech were preserved
critical band properties of the ear. Rather than through transitions such as voiced/unvoiced
implement a set ofbandpass filters to obtain the boundaries. Besides the above constant rate
channel amplitudes. as is done in the channel changes, linearly varying and oscillatory rate
vocoder, an envelope of the sine-wave ampli- changes have been applied to synthetic speech,
tudes is constructed by linearly interpolating resulting in natural-sounding speech that is free
between sine-wave peaks and sampling at the of artifacts [9).
critical band frequencies. Moreover, the location Since the synthesis procedure consists of
of the critical band frequencies was made pitch- summing the sinusoidal waveforms for each of
adaptive whereby the baseband samples are the measured frequencies. the procedure is
linearly seperated by the pitch frequency and ideally suited for performing various frequency
with the log-spacing applied to the high-fre- transformations. The procedure has been em-
quency channels as required. ployed to warp the short-time spectral envelope
In order to preserve the naturalness at rates at and pitch contour of the speech waveform and,
4.8 kbps and below, a synthetic phase model conversely, to alter the pitch while preserving
was employed that phase-locks all of the sine the short-time spectral envelope. These speech
waves to the fundamental and adds a pitch- transformations can be applied simultaneously
dependent quadratic phase dispersion and a so that time- and frequency (or pitch)-scaling
voicing-dependent random phase to each sine can occur together by simultaneously stretch-
wave [8). Using this technique at 4.8 kbps, the ing and shifting frequency tracks. These joint
synthesized speech achieved a diagnostic rhyme operations can be performed with a continu-
test (DRT) score of94 (for three male speakers). ously adjustable rate change.
A real-time, 4800-bps system that uses two
ADSP2100 signal-processing chips has been Audio Preprocessing
successfully implemented. The technology de-
veloped from this research has been transferred The problem ofpreprocessing speech that is to
to the commercial sector (CYLINK, INC). where a be degraded by natural or man-made distur-

The Lincoln Laboratory Journal. Volume 1. Number 2 (1988) 163


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

bances arises in applications such as attempts vocal-tract spectral envelope and a pitch-period
to increase the coverage area ofAM radio broad- estimate that changes as a function of time; the
casts and in improving ground-to-air communi- dispersive solution thus adapts to the time-
cations in a noisy cockpit environment. The varying characteristics of the speech waveform.
transmitter in such cases is usually constrained This phase solution is sampled at the sine-wave
by a peak operating power or the dynamic range frequencies and linearly interpolated across
of the system is limited by the sensitivity char- frames to form the system phase component of
acteristics of the receiver or ambient-noise lev- each sine wave [10].
els. Under these constraints, phase dispersion This approach lends itself to coupling phase
and dynamic-range compression, along with dispersion, dynamic-range compression, and
spectral shaping (pre-emphasis), are combined pre-emphasis via the STS. An example of the
to reduce the peak-to-rms ratio of the transmit- application of the combined operations on a
ted signal. In this way, more average power can speech waveform is shown in Fig. 11. A system
be broadcast without changing the peak-power using these techniques produced an 8-dB re-
output ofthe transmitter, thus increasing loud- duction in peak-to-rms level, about 3 dB better
ness and intelligibility at the receiver while than commercial processors. This work is cur-
adhering to peak-output power constraints. rently being extended for further reduction of
This problem is similar to one in radar in peak-to-rms level and for further improvement
which the signal is periodic and given as the of quality and robustness. Overall system per-
output of a transmit filter the input of which formance will be evaluated by field tests con-
consists of periodic pulses. The spectral magni- ducted by Voice of America over representative
tude of the filter is specified and the phase of the transmitter and receiver links.
filter is chosen so that, over one period, the
signal is frequency-modulated (a linear fre- Conclusions
quency modulation is usually employed) and
has a flat envelope over some specified duration. A sinusoidal representation for the speech
If there is a peak-power limitation on the trans- waveform has been developed; it extracts the
mitter, this approach imparts maximal energy amplitudes, frequencies, and phases of the
to the waveform. In the case of speech, the component sine waves from the short-time
voiced-speech signal is approximately periodic Fourier transform. In order to account for spu-
and can be modeled as the output of a vocal- rious effects due to sidelobe interaction and
tract filter the input ofwhich is a pulse train. The time-varying voicing and vocal-tract events, sine
important difference between the radar design waves are allowed to come and go in accordance
problem and the speech-audio preprocessing with a birth/death frequency-tracking algo-
problem is that, in the case ofspeech, there is no rithm. Once contiguous frequencies are
control over the magnitude and phase of the matched, a maximally smooth phase-interpola-
vocal-tract filter. The vocal-tract filter is charac- tion function is obtained that is consistent with
terized by a spectral magnitude and some natu- all of the frequency and phase measurements.
ral phase dispersion. Thus, in order to take This phase function is applied to a sine-wave
advantage ofthe radar signal design solutions to generator which is amplitude-modulated and
dispersion, the natural phase dispersion must added to the other sine waves to form the output
first be estimated and removed from the speech speech. It is important to note that, except in up-
signal. The desired phase can then be intro- dating the average pitch (used to adjust the
duced. All three ofthese operations, the estima- width of the analysis window), no voicing deci-
tion of the natural phase, the removal of this sions are used in the analysis/synthesis proce-
phase, and its replacement with the desired dure.
phase, have been implemented using the STS. In some respects the basic model has similari-
The dispersive phase introduced into the ties to one that Flanagan has proposed [11, 12].
speech waveform is derived from the measured Flanagan argues that because of the nature of

164 The Lincoln Laboratory Journal. Volume 1. Number 2 (I 988)


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

the peripheral auditory system, a speech wave- acoustic noise.


fonn can be expressed as the sum of the outputs While it may be tempting to conclude that the
of a fixed filter bank. The amplitude, frequency, ear is not phase-deaf, particularly for low-
and phase measurements of the filter outputs pitched speakers, it may be that this is simply a
are then used in various configurations of property of the sinusoidal analysis/synthesis
speech synthesizers. Although the present work system. No attempts were made to devise an
is based on' the discrete Fourier transfonn (DFT), experiment that would resolve this question
which can be interpreted as a filter bank, the use conclusively. It was felt, however, that the sys-
of a high-resolution DFT in combination with tem was well-suited to the design and execution
peak picking renders a highly adaptive filter of such an experiment, since it provides explicit
bank since only a subset of all of the DFT filters access to a set of phase parameters that are
are used at anyone frame. It is the use of the essential to the high-quality reconstruction of
frequency tracker and the phase interpolator speech.
that allows the filter bank to move with the Using the frequency tracker and the cubic
highly resolved speech components. Therefore, phase-interpolation function resulted in a func-
the system fits into the framework Flanagan tional description of the time-evolution of the
described but, whereas Flanagan's approach is amplitude and phase of the sinusoidal compo-
based on the properties of the peripheral audi- nents of the synthetic speech. For time-scale,
tory system, the present system is designed on pitch-scale, frequency modification of speech,
the basis of properties of the speech production and speech-coding applications such a func-
mechanism. tional model is essential. However, if the system
Attempts to perfonn magnitude-only recon- is used merely to produce synthetic speech,
struction were made by replacing the cubic using a set of sine waves, then the frequency-
phase tracks with a phase that was simply the tracking and phase-interpolation procedures
integral of the instantaneous frequency. While are unnecessary. In this case, the interpolation
the resulting speech was very intelligible and is achieved by overlapping and adding time-
free of artifacts, it was perceived as being differ- weighted segments of each of the sinusoidal
ent in quality from the original speech; the components. The resulting synthetic speech is
differences were more pronounced for low- essentially indistinguishable from the original
pitched (ie, pitch <-100 Hz) speakers. When the speech as long as the frame rate is at least
magnitude-only system was used to synthesize 100 Hz.
noisy speech, the synthetic noise took on a tonal A fixed-point 16-bitreal-timeimplementation
quality that was unnatural and annoying. It was of the system has been developed on the
concluded that this latter property would render Lincoln Digital Signal Processors [13] and using
the system unsuitable for applications for which the Analog Devices ADSP2100 processors. Di-
the speech would be subjected to additive agnostic rhyme tests have been perfonned and

Fig. 11-Audio preprocessing using phase dispersion and dynamic-range compression illustrates the reduction in the peak-
to-rms level.

The Lincoln Laboratory Journal. Volume 1. Number 2 (1988) 165


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

about one DRr point is lost relative to the hancement for AM radio broadcasting (10).
unprocessed speech of the same bandwidth
with the analysis/ synthesis system operating at Acknowledgments
a 50- Hz frame rate. The system is used in
research aimed at the development of a multi- The authors would like to thank their col-
rate speech coder (14). A practical low-rate league, Joseph Tierney, for his comments and
coder has been developed at 4800 bps and suggestions in the early stages of this work. The
2400 bps using two commercially available authors would also like to acknowledge the
DSP chips. Furthermore, the resulting technol- support and encouragement of the late Anton
ogy has been transferred to private industry Segota of RADC/EEV, who was the Air Force
for commercial development. The sinusoidal program manager for this research from its in-
analysis/synthesis system has also been ap- ception in September 1983 until his death in
plied successfully to problems in time scale, March 1987.
pitch scale, and frequency modification of This work was sponsored by the Department
speech (9) and to the problem of speech en- of the Air Force.

166 The Lincoln Laboratory Journal. Volume 1, Number 2 (1988)


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

References 8. RJ. McAulay and T.F. Quatieri. "Multirate Sinusoidal


Transform Coding at Rates from 2.4 kbps to 8.0 kbps." Int.
ConI on Acoustics. Speech. and Signal Processing 87 (IEEE,
1. B.S. AtalandJ.R Remde. "A New Model ofLPC Excitation New York. 1987). p 1645.
for Producing Natural-Sounding Speech at Low Bit Rates." 9. T.F. Quatieri and RJ. McAulay. "Speech Transforma-
Int. Conj. on Acoustics, Speech. and Signal Processing 82 tions Based on a Sinusoidal Representation." IEEE Trans.
(IEEE, New York, 1982). p 614. Acoust. Speech Signal Process. ASSP-34, 1449 (Dec. 1986).
2. M.R Schroeder and B.S. Atal, "Code-Excited Linear 10. T.FQuatieri and RJ. McAulay. "Sinewave-based Phase
Prediction (CELP): High Quality Speech at Very Low Bit Dispersion for Audio Preprocessing." Int. Conj. onAcoustics,
Rates," Int. ConJ on Acoustics, Speech. and Signal Process- Speech. and Signal Processing 88 (IEEE, New York. 1988),
ing 85 (IEEE. New York. 1985). p 937. p 2558.
3. P. Hedelin, "A Tone-Oriented Voice-Excited Vocoder," Int. 11. J. L. Flanagan. "Parametric Coding of Speech Spectra."
ConI onAcoustics, Speech. and Signal Processing 81 (IEEE. J. Acoust. Soc. ojAmerica 68. 412 (1980).
New York, 1981), p 205. 12. J.L. Flanagan and S.W. Christensen, "Computer Stud-
4. L.B. Almeida and F.M. Silva, "Variable-Frequency Syn- ies on Parametric Coding ofSpeech Spectra," J. Acoust. Soc.
thesis: An Improved Harmonic Coding Scheme." Int. ConI ojAmerica 68, 420 (1980).
on Acoustics, Speech. and Signal Processing 84 (IEEE, New 13. P.E. Blankenship. "LDVT: High Performance Minicom-
York, 1984). p 27.5.1. puter for Real-Time Speech Processing." EASCON '75
5. H. Van Trees, Detection, Estimation and Modulation (IEEE. New York, 1975). p 214-A.
Theory, Part! (John Wiley, New York. 1968), Chap 3. 14. RJ. McAulay and T.F. Quatieri, "Mid-Rate Coding
6. RJ. McAulay and T.F. Quatieri. "Speech Analysis/ Based on a Sinusoidal Representation ofSpeech. " Int. ConJ
Synthesis Based on a Sinusoidal Representation." IEEE on Acoustics. Speech. and Signal Processing 85 (IEEE, New
Trans. Acoust. SpeechSignalProcess. ASSP-34, 744 (1986). York. 1985), p 945.
7. RJ. McAulay and T.F. Quatieri, "Magnitude-Only Recon-
struction Using a Sinusoidal Speech Model," Int. Conlon
Acoustics, Speech. and Signal Processing 84 (IEEE. New
York, 1984). p 27.6.1.

The Lincoln Laboratory Journal, Volume 1, Number 2 (1988) 167


McAulay et aI. - Speech Processing Based on a Sinusoidal Model

ROBERT J. McAULAY is a THOMAS F. gUATIERl is a


senior staff member in the staff member in the Speech
Speech Systems Technology Systems Technology Group.
Group. He received a BASe where he is working on
degree in engineering phys- problems in digital signal
ics (with Honors) from the processing and their appli-
University of Toronto in cation to speech enhance-
1962, an MSc degree in elec- ment. speech coding. and
trical engineering from the data communications. He
University of Illinois in 1963, and a PhD degree in electrical received a BS degree (summa cum laude) from Tufts Univer-
engineering from the University of California, Berkeley, in sity and SM. EE, and ScD degrees from Massachusetts In-
1967. In 1967 he joined Lincoln Laboratory and worked on stitute ofTechnology in 1975, 1977. and 1979. respectively.
problems in estimation theory and signal/filter design us- He was previously a member of the Sensor ProceSSing
ing optimal control techniques. From 1970 to 1975 he was Technology group involved in multidimensional digital sig-
a member of the Air Traffic Control Division and worked on nal processing and image processing. Tom received the
the development ofaircraft tracking algOrithms. Since 1975 1982 Paper Award of the IEEE Acoustics. Speech and Sig-
he has been involved with the development of robust nar- nal Processing Society for the best paper by an author under
row-band speech vocoders. Bob received the M. Barry thirty years ofage. He is a member of the IEEE Digital Signal
Carlton award in 1978 for the best paper published in the Processing Technical Committee, Tau Beta Pi. Eta Kappa
IEEE Transactions on Aerospace and Electronic Systems. Nu. and Sigma Xi.
In 1987 he was a member of the panel on Removal of Noise
from a Speech/Noise Signal for Bioacoustics and Bi-
omechanics of the National Research Council's Committee
on Hearing.

168 The Lincoln Laboratory Journal, Volume 1. Number 2 (1988)

You might also like