Professional Documents
Culture Documents
to a target vector y by the conversion function which mini- 3.3. Residual Codebook Method
mizes the mean squared error between the converted source
In addition to the observation that the residual signal also
and the target vectors observed in training:
contains speaker-dependent information, it has to be men-
I
X tioned that the line spectral frequencies which describe the
y= p(i|x) · (µyi + Σyx xx −1
i Σi (x − µxi )) , vocal tract characteristics and the corresponding residuals
i=1 are even correlated. This insight led to the idea that the
residuals of the converted speech could be predicted based
αi N (x|µxi , Σxx
i ) on the converted feature vectors and resulted in the follow-
where p(i|x) = I
and
ing residual prediction technique [9].
αj N (x|µxj , Σxx
P
j ) In training phase, for each pitch-synchronous frame, we
j=1
compute the linear predictive coefficients and convert them
Σxx Σxy µxi
· ¸ · ¸
Σi = i i ; µi = . to a cepstral representation. Then, the probability distribu-
Σyx
i Σyy
i µyi tion of the set of all linear predictive cepstral vectors seen in
The target line spectral frequency vectors are transformed to training is modeled by means of a Gaussian mixture model
linear predictive coefficients that are used in the framework with Irc mixture components. Now, we determine the typ-
of linear prediction pitch-synchronous overlap and add [7] ical residual magnitude spectra m̂i for each mixture com-
to generate the converted speech signal. Here, for each time ponent i by computing a weighted average of all residual
frame, the respective underlying residual is required. In the magnitude spectra mn seen in training where the weights
following section, we want to deal with techniques that al- are the posterior probabilities p(i|vn ) that a given cepstral
low for predicting these residuals. vector vn belongs to the mixture component i:
N
P
3. SEVEN TIMES RESIDUAL PREDICTION mn p(i|vn )
n=1
m̂i = N
.
P
3.1. The Trivial Solution: Copying Source Residuals p(i|vν )
ν=1
Let us suppose that the vocal tract characteristics of a speak-
er’s speech are represented by line spectral frequencies, and During the conversion phase, for each frame, we obtain a
the residuals correspond to the excitation signal that hardly converted cepstral vector ṽ that serves as basis for the pre-
contains speaker-dependent information. Then, the simplest diction of the residual magnitude spectrum m̃ by calculating
idea for the residual prediction is to take the residuals of the a weighted sum over all mixture components:
source speech and filter them by means of the converted fea- Irc
tures. This technique was used by A. Kain and M. W. Ma-
X
m̃ = m̂i p(i|ṽ) .
con, but they stated that “merely changing the spectrum is i=1
not sufficient for changing the speaker identity” [4]. Most
of the listeners “had the impression that a ‘third’ speaker 3.4. Spectral Refinement
was created”.
When applying the spectral refinement technique [10], at
first, we train a Gaussian mixture model with Isr mixture
3.2. The Cheat: Copying Reference Residuals
components on the line spectral frequency representation of
Hence, it seems that the residuals contain a lot of speaker- all target speaker spectra seen in the training corpus. Now,
dependent information so that the contribution of the spec- we introduce the vector P (v) = [p(1|v), . . . , p(Isr |v)]0
tral transformation to the properties of the converted signal which consist of the posterior probabilities that vector v
belongs to the components 1, . . . , Isr , and a matrix M = with 0 < σk ≤ 1 that are the voicing degrees of the frames
[M1 , . . . , MIsr ] which consists of Isr prototypes of logarith- to be converted, determined according to [13], and the voic-
mic residual representations. In order to determine M , we ing gain α. At last, we obtain the final residuals by applying
minimize the following square error: a normal distribution function to compute a weighted aver-
age over all residual vectors r̃1K , the standard deviation is
N
X defined by the product of gain and voicing degree:
ε= S(log(mn ) − M P (vn )) ,
n=1 K
P
N (κ|k, ασk ) · r̃κ
where S(x) is the sum over the squared elements of a vector rk∗ = κ=1
. (2)
x, mn is the residual magnitude spectrum of the nth speech K
P
N (κ|k, ασk )
frame seen in training. κ=1
During the conversion phase, for each frame, we are
given a transformed feature vector ṽ and predict the cor- This equation can be interpreted as follows: In case of
responding residual magnitude spectrum as voiced frames (σ ≈ 1), we obtain a wide bell curve that
averages over several neighbored residuals, whereas for un-
m̃ = exp(M P (ṽ)) . voiced frames (σ → 0), the curve approaches a Dirac func-
tion, i.e., there is no local smoothing, the residuals and the
3.5. Residual Selection corresponding phase spectra change chaotically over the
time as expected in unvoiced regions.
The residual codebook method and the spectral refinement In order to be able to execute the summation, the vectors
described in the previous sections try to represent an arbi- r̃1K must have the same lengths. This is achieved by utiliz-
trary residual by a linear combination of a limited number ing a normalization as suggested in [10], where all residuals
of prototype residuals. Since both methods only deal with are normalized to the norm fundamental frequency f0n , cf.
the residual magnitude spectrum, in addition, they have to Table 1.
apply a phase prediction; for details, please refer to the re-
spective publications [9, 10].
3.7. Unit Selection
To better model the manifold characteristics of the resid-
uals and predict both magnitude as well as phase spectra Although the residual smoothing approach essentially im-
at the same time, the residual selection technique stores all proves the speech quality (from a mean opinion score of
residuals rn seen in training into a table together with the 2.0 to 2.6, cf. Table 4) it is still insufficient for applica-
corresponding feature vectors vn that this time are com- tions where the quality is of importance as for server-based
posed of the line spectral frequencies and their deltas [11]. speech synthesis with fs ≥ 16 kHz. Mainly, this is due
In the conversion phase, we have the current feature vec- to an oversmoothing caused when the voicing gain α is too
tor ṽ of the above described structure and choose one resid- large. This results in a deterioration of the articulation and
ual from the table by minimizing the square error between increases the voicing of unvoiced sounds. However, when
ṽ and all feature vectors seen in training α is too small, the artifacts are not sufficiently suppressed,
hence, the choice of α is based on a compromise.
r̃ = rñ with ñ = arg min S(ṽ − vn ) . (1) Recently, we presented the unit selection approach [14],
n = 1,...,N
a technique that is well-known from concatenative speech
synthesis [15]. Unit selection-based residual prediction is
3.6. Residual Smoothing
able to more reliably select residuals from the training data-
When listening to converted speech generated using the re- base. Consequently, the selected residual sequence r̃1K al-
sidual selection technique, in particular in voiced regions, ready contains less artifacts, thus, we can use smaller val-
we note a lot of artifacts and, sometimes, obviously im- ues for α and, hence, improve the quality of the converted
proper residuals that occur due to the insufficient correla- speech.
tion between feature vectors and residuals. In voiced re- Generally, in the unit selection framework, two cost func-
gions, the signal should be almost periodic; we do not ex- tions are defined. In this case, the target cost C t (rk , ṽk ) is
pect the residuals to change abruptly. In unvoiced regions, an estimate of the difference between the database residual
as mentioned in Section 1, the residuals should feature a rk and that selected by means of feature vector ṽk . The con-
random phase spectrum that essentially changes from frame catenation cost C c (rk−1 , rk ) is an estimate of the quality of
to frame. These considerations led to the idea of a voicing- a join between the consecutive residuals rk−1 and rk .
dependent residual smoothing as proposed in [12]. The searched residual sequence r̃1K is determined by
We are given the sequence r̃1K of predicted residual tar- minimizing the sum of the target and concatenation costs
get vectors derived from Eq. 1, a sequence of scalars σ1K applied to an arbitrarily selected sequence of K elements
training test [%] source target neither
M time K time source residuals 20 10 70
female 51,365 341.1 s 1,901 14.0 s reference residuals 0 79 21
male 44,772 377.2 s 1,703 13.4 s residual codebook 0 70 30
spectral refinement 0 83 17
residual selection 0 70 30
Table 2. Corpus statistics; M and K are the numbers of residual smoothing 0 85 15
frames in the training and test data, respectively. unit selection 2 83 15
from the set of residuals seen in training, r1M , given the tar-
Table 3. Results of the extended ABX test
get feature sequence ṽ1K :
K
r̃1K = arg min
X
C t (rk , ṽk ) + C c (rk−1 , rk ) m2f f2m total
r1K
k=1
source residuals 3.2 3.7 3.5
reference residuals 3.0 3.0 3.0
4. EVALUATION residual codebook 1.6 1.9 1.8
spectral refinement 2.0 2.0 2.0
4.1. Experimental Corpus residual selection 1.7 2.3 2.0
residual smoothing 2.2 2.9 2.6
The corpus utilized in this work contains 100 Spanish sen- unit selection 2.8 3.2 3.0
tences uttered by a female and a male speaker, cf. [16]. It
was designed to provide baseline voices for Spanish speech
synthesis, e.g. in the UPC text-to-speech system [17]. For Table 4. Results of the MOS test
the corpus statistics, cf. Table 2.
5. INTERPRETATION
4.2. Subjective Evaluation
The goal of the subjective evaluation of the described resid- Addressing the first question that we asked in Section 4.2,
ual prediction techniques is to answer two questions: we find that all assessed techniques succeed in converting
the source voice to the target voice in more than 70% of
• Does the technique change the speaker identity in the the cases, cf. Table 3. The only exception is the use of the
intended way? source residuals where the majority of the listeners had the
• How does a listener assess the overall sound quality impression of hearing a third speaker as we have already ex-
of the converted speech? pected in Section 3.1. Spectral refinement, residual smooth-
ing, and unit selection showed almost the same conversion
We want to find the answers by means of the extended ABX performance: In about 85% of the cases, the target voice
test and the mean opinion score (MOS) evaluation described was recognized which is even higher than using the time-
in [18]. aligned reference residuals.
We performed both, female-to-male (f2m) and male-to- When we have a look at Table 4, we note that using
female (m2f) voice conversion using the corpus described in unprocessed residuals produces the highest speech quality.
Section 4.1. Now, 27 evaluation participants, 25 of whom Applying the reference residuals works worse as the time-
specialists in speech processing, were asked if the converted alignment based on dynamic time warping sometimes re-
voice sounds similar to the source or to the target voice or sults in prosodic artifacts. However, we have to empha-
to neither of them (extended ABX test). In doing so, they size again that the reference speech is not given in a real
were asked to ignore the recording conditions, the sound or world situation; we only considered this procedure to ob-
synthesis quality of the samples, the speaking style, and the tain a standard of comparison.
prosody. Furthermore, they assessed the overall quality of Having a look at the remaining techniques that succeed-
the converted speech on an MOS scale between 1 (bad) and ed in converting the source to the target voice, we see that
5 (excellent).1 the unit selection technique outperforms the others in terms
Table 3 reports the results of the extended ABX test and of speech quality and achieves a quality similar to that of us-
Table 4 those of the MOS rating depending on the residual ing the reference residuals (MOS = 3.0). Past experiments
prediction technique and the gender combination. on voice conversion techniques that explicitly are to ”intro-
1 The subjective evaluation was carried out using the web interface duce as few distortions as possible” [19] while ignoring the
http://suendermann.com/unit. success of the speaker identity conversion resulted in MOS
scores about 3.0 [19], also on the same corpus [18]. Conse- [10] H. Ye and S. J. Young, “Quality-Enhanced Voice Mor-
quently, the achieved speech quality is state-of-the-art. phing Using Maximum Likelihood Transformations,”
To appear in IEEE Trans. on Speech and Audio Pro-
cessing, 2005.
6. CONCLUSION
[11] H. Ye and S. J. Young, “High Quality Voice Morph-
In this paper, we compared seven residual prediction tech- ing,” in Proc. of the ICASSP’04, Montreal, Canada,
niques and applied them to voice conversion. We found that 2004.
the unit selection-based approach outperforms the others in
terms of speech quality. Furthermore, this technique re- [12] D. Sündermann, A. Bonafonte, H. Ney, and H. Höge,
sulted in a voice conversion performance that was one of “A Study on Residual Prediction Techniques for Voice
the best achieved in the evaluation. Conversion,” in Proc. of the ICASSP’05, Philadelphia,
USA, 2005.
[5] Y. Stylianou, O. Cappé, and E. Moulines, “Statistical [17] A. Bonafonte, I. Esquerra, A. Febrer, J. A. R. Fono-
Methods for Voice Quality Transformation,” in Proc. llosa, and F. Vallverdú, “The UPC Text-to-Speech
of the Eurospeech’95, Madrid, Spain, 1995. System for Spanish and Catalan,” in Proc. of the
ICSLP’98, Sydney, Australia, 1998.
[6] V. Goncharoff and P. Gries, “An Algorithm for Ac- [18] D. Sündermann, A. Bonafonte, H. Ney, and H. Höge,
curately Marking Pitch Pulses in Speech Signals,” in “Time Domain Vocal Tract Length Normalization,” in
Proc. of the SIP’98, Las Vegas, USA, 1998. Proc. of the ISSPIT’04, Rome, Italy, 2004.
[7] E. Moulines and F. Charpentier, “Pitch-Synchronous [19] M. Eichner, M. Wolff, and R. Hoffmann, “Voice Char-
Waveform Processing Techniques for Text-to-Speech acteristics Conversion for TTS Using Reverse VTLN,”
Synthesis Using Diphones,” Speech Communication, in Proc. of the ICASSP’04, Montreal, Canada, 2004.
vol. 9, no. 5, 1990.