Professional Documents
Culture Documents
Quatieri
W
thetic waveforms are presented.
, The above sinusoidal transform system (STS)
has found practical application in a number of
Impulse speech problems. Vocoders have been
Train
Generator --- Vocal
developed that operate from 2.4 kbps to 4.8 kbps
providing good speech quality that increases
~
c
Tract
Filter
Speech
Random EXCITATION VOCAL TRACT
Noise I--
Generator
~
s(t)
Sine Wave h(t, T)
Generation e(t)
This paper derives a sinusoidal model for the Fig. 2 - The sinusoidal speech model consists of an exci-
speech waveform, characterized by the ampli- tation and vocal-tract response. The excitation waveform is
characterized by the amplitudes, frequencies, and phases
tudes, frequencies, and phases ofits component C!f the underlying sine waves of the speech; the vocal tract
sine waves, thatleads to a new analysis/synthe- IS modeled by the time-varying linear filter, the impulse re-
sis technique. The glottal excitation is repre- sponse of which is h(t; -r).
sented as a sum ofsine waves that, when applied
to a time-varying vocal-tract filter, leads to the
more or less uniformly with increasing bit rate.
desired sinusoidal representation for speech
In another area the STS provides high-quality
waveforms (Fig. 2).
speech transformations such as time-scale and
A parameter-extraction algorithm has been
pitch-scale modifications. Finally, a large
developed that shows that the amplitudes, fre-
research effort has resulted in a new technique
quencies, and phases of the sine waves can be
for speech enhancement for AM radio
obtained from the high-resolution short-time
broadcasting. These applications are
Fourier transform (STFT), by locating the peaks
considered in more detail later in the text.
of the associated magnitude function. To syn-
thesize speech, the amplitudes, frequencies,
and phases estimated on one frame must be Speech Production Model
matched and allowed to evolve continuously
In the speech production model, the speech
into the set of amplitudes, frequencies, and
waveform, s(t), is modeled as the output of a
phases estimated on a successive frame. These
linear time-varying filter that has been excited
issues are resolved using a frequency-matching
by the glottal excitation waveform, eft). The filter,
algorithm in conjunction with a solution to the
which models the characteristics of the vocal
phase-unwrapping and phase-interpolation
tract, has an impulse response denoted by h(t; -r).
problem.
The speech waveform is then given by
A system was built and experiments were
performed with it. The synthetic speech was
judged to be of excellent quality - essentially s(t) = ft
o
hlt- T; t) e(r) dr. (1)
indistingUishable from the original. The results
If the glottal excitation waveform is represented phases of the component sine waves must be
as a sum of sine waves of arbitrary amplitudes. extracted from the original speech waveform.
frequencies. and phases. then the model can be
written as Estimation of Speech Parameters
eft) = Re ~ al(t) exp {j [~[ WI (a) da + <PI]} (2) The key problem in speech analysis/synthe-
sis is to extract from a speech waveform the
parameters that represent a quasi-stationary
where. for the l th sinusoidal component. Clz{t) portion of that waveform. and to use those
and wlt) represent the amplitude and frequency parameters (or coded versions ofthem) to recon-
(Fig. 2). Because the sine waves will not neces- struct an approximation that is "as close as
sarily be in phase. f/J1 is included to represent a possible" to the original speech. The parameter-
fIxed phase-offset. This model leads to a particu- extraction algorithm. or estimator. should be
1arly simple representation for the speech wave- robust. as the parameters must often be ex-
form. Letting tracted from a speech signal that has been con-
taminated with acoustic noise.
H(w; t) = M(w; t) exp lj<I>(w; t)l (3) In general. it is diffIcult to determine analyti-
cally which of the component sine waves and
represent the time-varying vocal-tract transfer their amplitudes. frequencies. and phases are
function and assuming that the excitation para- necessary to represent a speech waveform.
meters given in Eq. 2 are constant throughout Therefore. an estimator based on idealized
the duration of the impulse response ofthe filter speech waveforms was developed to extract
in effect at time t. then using Eqs. 2 and 3 in Eq. these parameters. As restrictions on the speech
1 leads to the speech model given by waveform were relaxed in order to model real
L(t)
speech better. adjustments were made to the es-
s(t) = ~ al(t) M!wI(t); tJ timator to accommodate these changes.
1=1 In the development of the estimator. the time
(4) axis was first broken down into an overlapping
sequence of frames each of duration T. The
center of the analysis window for the k th frame
occurs at time t k . Assuming that the vocal-tract
By combining the effects of the glottal and vocal- and glottal parameters are constant over an
tract amplitudes and phases. the representa- interval of time that includes the duration of the
tion can be written more concisely as analysis window and the duration of the vocal-
L(t) tract impulse response. then Eq. 7 can be writ-
s(t) = ~AI(t) exp ljl/!I(t)] (5) ten as
1=1
(8)
where
where the superscript kindicates that the para-
(6) meters of the model may vary from frame to
frame. Using Eq. 8. in Eq. 5 the synthetic speech
l/!I(t) = f o
t wda) da + <I>[wI(t); tl + <PI (7) waveform over frame k can be written as
= ~ Iy(n) ~ ~ 2
,2 _ 2 Re y(n) s*(n) + Is(n) 1 . is the STFT of the measurement signal. By
n n n completing the square in Eq. 14. the error can be
Substituting the speech model of Eq. 9 into written as
Eq. 10 leads to the error expression
k
L
2
fk = ~ Iy(n) 1 - 2 Re ~ ('Y~)* ~ y(n) exp (-jnw~) L
k (16)
n 1=1 n
+(N+ 1) ~ [IYlw~)-'Y~12-IYlW~)12I.
Lk Lk (11) 1=1
+ (N + 1) ~ ~ 'Y k ('Y 1k )* sine (w Ik -
~ ~ I
wk)
1=1 1=1
1 from which it follows that the optimal estimate
for the amplitude and phase is
where sinc(x) =sin [(N+ l)x/21/[(N + 1) sin (x/2)].
The task of the estimator is to identify a set of .y~ = Y(lw;) (17)
sine waves that minimizes Eq. 11. Insights into
the development of a suitable estimator can be which reduces the error to
obtained by restricting the class ofinput signals
to the idealization of perfectly voiced speech, ie,
speech that is periodic, hence having compo-
nent sine waves that are harmonically related.
In this case the synthetic speech waveform can From this calculation it follows. therefore, that
be written as the error is minimized by selecting all of the har-
L
k monic frequencies in the speech bandwidth,
s(n) = ~ 'Y~ exp Unlw~) (12) fl (ie, L k =fl/w~).
1=1 Equations 15 and 17 completely specify the
structure of the ideal estimator and show that
where w~ =21T/T~ and where T~ is the pitch the optimal estimator depends on the speech
period assumed to be constant over the duration data through the STFT (Eq. 15). Although these
of the k th frame. For the purpose of establishing results are eqUivalent to a Fourier-series repre-
the structure of the ideal estimator. it is further sentation of a periodic waveform, the results
assumed that the pitch period is known and that lead to an intuitive generalization to the more
the width of the analysis window is a multiple of practical case. This is done by considering the
T~ Under these highly idealized conditions. functionlY (w) 12 to be a continuous function of w.
the sinc (.) function in the last term of Eq. 11 For the idealized voiced-speech case, this
reduces to function (called a periodogram) will be pulse-
like in nature, with peaks occurring at all of the nated, by using the weighted STFf. Letting Y"(w)
pitch harmonics. Therefore, the frequencies of denote the weighted STFf, ie,
the underlying sine waves correspond to the lo- NI2
cation of the peaks of the periodogram, and the Y(w) = ~ w(n) y(n) exp (-jnw) (22)
estimates of the amplitudes and phases are n=-NI2
obtained by evaluating the STFf at the frequen-
cies associated with the peaks of the periodo- where w(n) represents the temporal weighting
gram. This interpretation permits the extension due to the window function, then the practical
of the estimator to a more generalized speech version of the idealized estimator estimates the
waveform, one that is not ideally voiced. This frequencies of the underlying sine waves as the
extension becomes evident when the STFf is locations of the peaks of I(w) I. Letting these
calculated for the general sinusoidal speech frequency estimates be denoted by {w~}, then
model given in Eq. 9. In this case the STFf is the corresponding complex amplitudes are
simply given by
k
L
(23)
Y(w) = ~ 'Y~ sine (w~ - w) . (19)
1=1
Assuming that the component sine waves have
Provided the analysis window is "wide enough" been properly resolved, then, in the absence of
that noise, A~ will yield the value of an underlying
(20) sine wave, provided the window is scaled so that
NI2
then the periodogram can be written as ~ w(n) = 1 . (24)
k n=-NI2
L
2
IY(w)1 = ~ 1'Y~12sine2(w~-w), (21)
Using the Hamming window for the weighted
1=1
STFf provided a very good sidelobe structure in
and, as before, the location of the peaks of the that the leakage problem was eliminated; it did
periodogram corresponds to the underlying so at the expense of broadening the main lobes
sine-wave frequencies. The STFf samples at of the periodogram. In order to accommodate
these frequencies correspond to the complex this broadening, the constraint implied by Eq.
amplitudes. Therefore, provided Eq. 20 holds, 20 must be revised to require that the window
the structure of the ideal estimator applies to a width be at least 2.5 times the pitch period. This
more general class of speech waveforms than revision maintains the resolution features that
perfectly voiced speech. Since, dUring steady were needed to justifY the optimality properties
voicing, neighboring frequencies are approxi- of the periodogram processor. Although the
mately seperated by the width of the pitch window width could be set on the basis of the
frequency, Eq. 20 suggests that the desired instantaneous pitch, the analyzer is less sensi-
resolution can be achieved most of the time by tive to the performance of the pitch extractor if
requiring that the analysis window be at least the window width is set on the basis of the
two pitch periods wide. average pitch instead. The pitch computed
These properties are based on the assumption dUring strongly voiced frames is averaged using
that the sinc (.) function is essentially zero a 0.25-s time constant, and this averaged pitch
outside of the region defmed by Eq. 20. In fact, is used to update, in real time, the width of the
this approximation is not a valid one, because analysis window. During frames of unvoiced
there will be sidelobes outside of this region due speech, the window is held fixed at the value
to the rectangular window implicit in the defini- obtained on the preceding voiced frame or 20
tion ofthe STFf. These sidelobes lead to leakage ms, whichever is smaller.
that compromises the performance of the esti- Once the width for a particular frame has been
mator' a problem that is reduced, but not elimi- specified, the Hamming window is computed
and normalized according to Eq. 24, and the tation applies only to one analysis frame; differ-
STIT of the input speech is taken using a 512- ent sets of these parameters are obtained for
point fast Fourier transform (FIT) for 4-kHz each frame. The next problem to address, then,
bandwidth speech. A typical periodogram for is how to associate the amplitudes, frequencies,
voiced speech. along with the amplitudes and and phases measured on one frame with those
frequencies that are estimated using the above found on a successive frame.
procedure, is plotted in Fig. 3.
In order to apply the sinusoidal model to Frame-to-Frame Peak Matching
unvoiced speech, the frequencies corresponding
to the periodogram peaks must be close enough If the number of periodogram peaks were
to satisfy the requirement imposed by the constant from frame to frame, the peaks could
Karhunen-Loeve expansion [5) for noiselike
signals. If the window width is constrained to be -20 -10 10 20
at least 20 ms wide, on average, the correspond- I I I I I I I
950.0 ms
ing periodogram peaks will be approximately
100 Hz apart, enough to satisfy the constraints
of the Karhunen-Loeve sinusoidal representa-
tion for random noise. A typical periodogram for
a frame of unvoiced speech, along with the
50
estimated amplitudes and frequencies, is plot- x"x
X
ted in Fig. 4. 40 ~
(a)
(25A)
1T
denote the successive sets of parameters for the fJ i (n)
ANALYSIS:
Phases
tan- 1(.)
Speech
In~ OFT
Frequencies
Peak
I" Picking
Window Amplitudes
SYNTHESIS:
Phases Frame-To-Frame
Phase 8(t) Synthetic
Sine Wave
_ Frequencies Unwrapping & ----.. Generator
Sum All
Sine Waves
---. Speech
Interpolation Output
Fig. 8 - This block diagram of the sinusoidal analysis/synthesis system illustrates the major functions subsumed within the
system. Neither voicing decisions nor residual waveforms are required for speech synthesis.
estimation. The maximum number of peaks that the system was capable of synthesizing a
used in synthesis was set to a fIxed number broad class of signals including multispeaker
(-80); if excess peaks were obtained only the waveforms. music, speech in a music back-
largest peaks were used. ground. and marine biologic signals such as
A large speech data base was processed with whale sounds. Furthermore. the reconstruction
this system, and the synthetic speech was es- does not break down in the presence of noise.
sentially indistinguishable from the original. A The synthesized speech is perceptually indistin-
visual examination of the reconstructed pas- guishable from the original noisy speech with
sages shows that the waveform structure is essentially no modification of the noise charac-
essentially preserved. This is illustrated by Fig. teristics. Illustrations depicting the perform-
9, which compares the waveforms ofthe original ance of the system in the face of the above
speech and the reconstructed speech dUring an degradations are provided in Ref. 6. More re-
unvoiced/voiced speech transition. The com- cently a real-time system has been completed
parison suggests that the quasi-stationary using the Analog Devices ADSP2100 16-bit
conditions imposed on the speech model are met fixed-point signal-processing chips and per-
and that the use of the parametric model based formance equal to that of the simulation has
on the amplitudes, frequencies, and phases of a been achieved.
set of sine-wave components is justified for both Although high-quality analysis/synthesis of
voiced and unvoiced speech. speech has been demonstrated using the ampli-
In another set of experiments it was found tudes, frequencies. and phases of the peaks
Vocoding
Since the parameters ofthe sinusoidal speech
~----""",,"'--1,oo~".-"~.~"iIIl'lit"H""~M""''''''''''-~''''--- model are the amplitudes, frequencies, and
phases of the underlying sine waves, and since,
for a typical low-pitched speaker there can be as
many as 80 sine waves in a 4-kHz speech
(b) -120 msl-
The amplitudes. frequencies, and phases ofthis product is being produced that will be available
reduced set of sine waves are then coded. Since in 1989.
the neighboring sine-wave amplitudes are natu-
rally correlated, especially for low-pitched Transformations
speakers. pulse-code modulation is used to
encode the differential log amplitudes. The goal oftime-scale modification is to main-
In the earlier versions of the vocoder all the tain the perceptual quality ofthe original speech
sine-wave amplitudes were coded simply by while changing the apparent rate ofarticulation.
allocating the available bits to the number to be This implies that the frequency trajectories of
coded. Since a low-pitched speaker can produce the excitation (and thus the pitch contour) are
as many as 80 sine waves. then in the limit of 1 stretched or compressed in time and that the
bit/sine-wave amplitude, 4000 bps would be re- vocal tract changes at the modified rate. To
quired at a 50-Hz frame rate. For an 8-kbps achieve these rate changes, the system ampli-
channel. this leaves 4.0 kbps for coding the tudes and phases. and the excitation ampli-
pitch. energy, and about 12 baseband phases. tudes and frequencies, along each frequency
However, at 4.8 kbps and below, assigning 1 bit/ track are time-scaled. Since the parameter esti-
amplitude immediately exhausts the coding mates ofthe unmodified synthesis are available
budget. so no phases can be coded. Therefore, a as continuous functions of time. in theory any
more efficient amplitude encoder had to be de- rate change is possible. Rate changes ranging
veloped for operation at these lower rates. between a compression oftwo and an expansion
The increased efficiency is obtained by allow- oftwo have been implemented with good results.
ing the channel separation to increase logarith- Furthermore, the natural quality and smooth-
mically with frequency, thereby exploiting the ness of the original speech were preserved
critical band properties of the ear. Rather than through transitions such as voiced/unvoiced
implement a set ofbandpass filters to obtain the boundaries. Besides the above constant rate
channel amplitudes. as is done in the channel changes, linearly varying and oscillatory rate
vocoder, an envelope of the sine-wave ampli- changes have been applied to synthetic speech,
tudes is constructed by linearly interpolating resulting in natural-sounding speech that is free
between sine-wave peaks and sampling at the of artifacts [9).
critical band frequencies. Moreover, the location Since the synthesis procedure consists of
of the critical band frequencies was made pitch- summing the sinusoidal waveforms for each of
adaptive whereby the baseband samples are the measured frequencies. the procedure is
linearly seperated by the pitch frequency and ideally suited for performing various frequency
with the log-spacing applied to the high-fre- transformations. The procedure has been em-
quency channels as required. ployed to warp the short-time spectral envelope
In order to preserve the naturalness at rates at and pitch contour of the speech waveform and,
4.8 kbps and below, a synthetic phase model conversely, to alter the pitch while preserving
was employed that phase-locks all of the sine the short-time spectral envelope. These speech
waves to the fundamental and adds a pitch- transformations can be applied simultaneously
dependent quadratic phase dispersion and a so that time- and frequency (or pitch)-scaling
voicing-dependent random phase to each sine can occur together by simultaneously stretch-
wave [8). Using this technique at 4.8 kbps, the ing and shifting frequency tracks. These joint
synthesized speech achieved a diagnostic rhyme operations can be performed with a continu-
test (DRT) score of94 (for three male speakers). ously adjustable rate change.
A real-time, 4800-bps system that uses two
ADSP2100 signal-processing chips has been Audio Preprocessing
successfully implemented. The technology de-
veloped from this research has been transferred The problem ofpreprocessing speech that is to
to the commercial sector (CYLINK, INC). where a be degraded by natural or man-made distur-
bances arises in applications such as attempts vocal-tract spectral envelope and a pitch-period
to increase the coverage area ofAM radio broad- estimate that changes as a function of time; the
casts and in improving ground-to-air communi- dispersive solution thus adapts to the time-
cations in a noisy cockpit environment. The varying characteristics of the speech waveform.
transmitter in such cases is usually constrained This phase solution is sampled at the sine-wave
by a peak operating power or the dynamic range frequencies and linearly interpolated across
of the system is limited by the sensitivity char- frames to form the system phase component of
acteristics of the receiver or ambient-noise lev- each sine wave [10].
els. Under these constraints, phase dispersion This approach lends itself to coupling phase
and dynamic-range compression, along with dispersion, dynamic-range compression, and
spectral shaping (pre-emphasis), are combined pre-emphasis via the STS. An example of the
to reduce the peak-to-rms ratio of the transmit- application of the combined operations on a
ted signal. In this way, more average power can speech waveform is shown in Fig. 11. A system
be broadcast without changing the peak-power using these techniques produced an 8-dB re-
output ofthe transmitter, thus increasing loud- duction in peak-to-rms level, about 3 dB better
ness and intelligibility at the receiver while than commercial processors. This work is cur-
adhering to peak-output power constraints. rently being extended for further reduction of
This problem is similar to one in radar in peak-to-rms level and for further improvement
which the signal is periodic and given as the of quality and robustness. Overall system per-
output of a transmit filter the input of which formance will be evaluated by field tests con-
consists of periodic pulses. The spectral magni- ducted by Voice of America over representative
tude of the filter is specified and the phase of the transmitter and receiver links.
filter is chosen so that, over one period, the
signal is frequency-modulated (a linear fre- Conclusions
quency modulation is usually employed) and
has a flat envelope over some specified duration. A sinusoidal representation for the speech
If there is a peak-power limitation on the trans- waveform has been developed; it extracts the
mitter, this approach imparts maximal energy amplitudes, frequencies, and phases of the
to the waveform. In the case of speech, the component sine waves from the short-time
voiced-speech signal is approximately periodic Fourier transform. In order to account for spu-
and can be modeled as the output of a vocal- rious effects due to sidelobe interaction and
tract filter the input ofwhich is a pulse train. The time-varying voicing and vocal-tract events, sine
important difference between the radar design waves are allowed to come and go in accordance
problem and the speech-audio preprocessing with a birth/death frequency-tracking algo-
problem is that, in the case ofspeech, there is no rithm. Once contiguous frequencies are
control over the magnitude and phase of the matched, a maximally smooth phase-interpola-
vocal-tract filter. The vocal-tract filter is charac- tion function is obtained that is consistent with
terized by a spectral magnitude and some natu- all of the frequency and phase measurements.
ral phase dispersion. Thus, in order to take This phase function is applied to a sine-wave
advantage ofthe radar signal design solutions to generator which is amplitude-modulated and
dispersion, the natural phase dispersion must added to the other sine waves to form the output
first be estimated and removed from the speech speech. It is important to note that, except in up-
signal. The desired phase can then be intro- dating the average pitch (used to adjust the
duced. All three ofthese operations, the estima- width of the analysis window), no voicing deci-
tion of the natural phase, the removal of this sions are used in the analysis/synthesis proce-
phase, and its replacement with the desired dure.
phase, have been implemented using the STS. In some respects the basic model has similari-
The dispersive phase introduced into the ties to one that Flanagan has proposed [11, 12].
speech waveform is derived from the measured Flanagan argues that because of the nature of
Fig. 11-Audio preprocessing using phase dispersion and dynamic-range compression illustrates the reduction in the peak-
to-rms level.
about one DRr point is lost relative to the hancement for AM radio broadcasting (10).
unprocessed speech of the same bandwidth
with the analysis/ synthesis system operating at Acknowledgments
a 50- Hz frame rate. The system is used in
research aimed at the development of a multi- The authors would like to thank their col-
rate speech coder (14). A practical low-rate league, Joseph Tierney, for his comments and
coder has been developed at 4800 bps and suggestions in the early stages of this work. The
2400 bps using two commercially available authors would also like to acknowledge the
DSP chips. Furthermore, the resulting technol- support and encouragement of the late Anton
ogy has been transferred to private industry Segota of RADC/EEV, who was the Air Force
for commercial development. The sinusoidal program manager for this research from its in-
analysis/synthesis system has also been ap- ception in September 1983 until his death in
plied successfully to problems in time scale, March 1987.
pitch scale, and frequency modification of This work was sponsored by the Department
speech (9) and to the problem of speech en- of the Air Force.