Professional Documents
Culture Documents
2. APPLICATIONS
2.1. Broadcast news diarization
The proposed diarization system was inspired by the system [1]
which won the RT04 fall evaluation and the ESTER 1 evaluation.
It was developed during the ESTER 2 evaluation campaign for the
transcription and diarization tasks, with the goal of minimizing both
word error rate and speaker error rate.
Automatic transcription requires accurate segment boundaries.
Although non-speech segments must be rejected in order to minimize insertion of words and to save computation time, segment
boundaries have to be set within non-informative zones such as filler
words. Indeed, having a word cut by a boundary disturbs the language model and increases the word error rate.
Speaker diarization needs to produce homogeneous speech segments; however, purity and coverage of the speaker clusters are the
main objectives here. Errors such as having two distinct clusters (i.e.,
detected speakers) corresponding to the same real speaker, or conversely, merging segments of two real speakers into only one cluster, get heavier penalty in the NIST time-based evaluation metric
than misplaced boundaries [7].
The system is composed of acoustic BIC segmentation followed
with BIC hierarchical clustering. Viterbi decoding is performed to
adjust the segment boundaries. Music and jingle regions are removed using GMMs, and long segments are cut to be shorter than 20
seconds. A hierarchical clustering using GMMs for speaker models
is computed over the clusters generated by the Viterbi decoding.
BICi,j =
ni + nj
ni
nj
log || log |i |
log |j |P (1)
2
2
2
1
P =
2
d(d + 1)
d+
+ log(ni + nj )
2
(2)
This penalty factor only takes the length of the two candidate
clusters into account whereas the standard factor uses the length of
the whole data as defined in [8]. Experiments in [1, 9] show that
better results are obtained by basing P on the length of the two candidate clusters only. However, both penalty factors are implemented
in the toolkit.
Segmentation based on Viterbi decoding
A Viterbi decoding is performed to generate a new segmentation.
A cluster is modeled by a HMM with only one state, represented
by a GMM with 8 components (diagonal covariance). The GMM is
learned by EM-ML over the segments of the cluster. The log-penalty
between two HMMs is fixed experimentally.
The segment boundaries produced by the Viterbi decoding are
not perfect: for example, some of them fall within words. In order
2.1.2. Evaluation
The methods were trained and developed according to data from the
ESTER/,1 (2003-2005) and ESTER 2 (2007-2008) evaluation campaigns. The results of the 2008 campaign are reported in [15]. The
data were recorded in French from 7 radio stations, and are divided
into 3 corpora: the training corpus corresponding to more than 200
hours, the development corpus corresponding to 20 shows, and the
evaluation corpus containing 26 shows.
During the evaluation, the LIUM obtained the best result:
10.80 % of Diarization Error Rate (DER) using the system described
in section 2.1, completed by a post-filtering step. The post-filtering
consists in removing segments longer than 4 seconds including silence, breath, and inspiration, as well as music fillers. Fillers are
extracted from the first pass of the speech recognition process [2].
The result was 11.03 % of DER without post-filtering.
Table 1 shows the current result on the test corpus of ESTER 2:
a DER of 10.01 % with no post-filtering. The gain of one point of
DER results from improvements in the management of consecutive
identical features and of aberrant likelihoods in the Viterbi decoder
after the campaign.
Speech detection
This detection relies on a Viterbi decoding using 3 GMMs corresponding to speech, silence, and dial tone. Each GMM is composed
of 16 Gaussians with diagonal covariance, and trained via the EMML algorithm over 20 recordings (1 hour of data) which were segmented automatically and controlled by a human.
Only the speech segments are kept for the next step.
Corpus
africa 1
inter
rfi
tvme
ESTER 2 Total
%Miss.
0.5 %
1.7 %
0.0 %
0.5 %
1.0 %
%F.A.
1.2 %
1.8 %
0.7 %
2.7 %
1.6 %
%Sub.
8.2 %
7.9 %
2.9 %
9.7 %
7.4 %
DER
9.85 %
11.42 %
3.72 %
12.89 %
10.01 %
%Miss.
0.2 %
0.2 %
%F.A.
2.5 %
2.5 %
%Sub.
5.0 %
3.7%
DER
7.65 %
6.36 %
Speaker detection
The speaker detection method is based on the E-HMM approach described in [5].
The conversation model is based on an HMM with 2 states. Each
state characterizes a cluster, i.e. speaker, and the transitions model
the changes between speakers.
A UBM with 16 diagonal components is trained over 4 hours of
speech automatically extracted by the speech detector.
A first speaker model is trained on all the speech data. The diarization is modeled by a one-state HMM and the speech segments
are labeled as speaker S0 . A second speaker model is then trained
using the three seconds of speech that maximize the likelihood ratio
between model S0 and the UBM. A corresponding state, labeled as
S1 , is added to the previous HMM.
The two speaker models are adapted according to the current
diarization. Then, Viterbi decoding produces a new diarization. The
adaptation and decoding steps are performed while the diarization
differs between two successive runs.
2.2.2. Evaluation
The development corpus is composed of 9 dialogues (33 minutes)
whereas the test corpus is composed of 8 dialogues (23 minutes).
In the MEDIA corpus, each part of the conversation was recorded
separately. In the evaluated corpus, both parts of the conversation
were mixed, and provided as a single channel, 8 kHz signal.
The manual transcriptions of the dialogue are available. The
boundaries of the segments were manually checked and corrected to
be accurate enough for diarization evaluation.
Table 3 shows the DER obtained through development and evaluation. The system obtains 2.47 % of DER over the test corpus. This
error rate is very low but the task is quite easy: there is a lot of silence
among speaker utterances and there is very little speech overlap.
Corpus
Dev
Test
%Miss.
2.0 %
1.6 %
%F.A.
1.3 %
0.6 %
%Sub.
1.2 %
0.3 %
DER
4.51 %
2.47 %
of speakers. However, this development affects other classes, so developers are incited to minimize the connections between experimental packages and stable packages.
4. PROGRAMMING WITH THE TOOLKIT
4.1. Containers
There are three main containers: one for show annotation, one for
acoustic vectors, and one for models.
Cluster and segment containers
Segment and Cluster objects allow managing a diarization. A segment is an object defined by the name of the recording, the start
time in the recording, and the duration. These three values identify
a unique region of signal in a set of audio recordings.
A cluster is a container of segments, where segments are sorted
first by name of recording, then by start time. A cluster should contain segments of the same nature: speakers, genders, background
environments, etc.
Clusters are stored in an associative container, named ClusterSet, where the key, corresponding to the name of the cluster,
is mapped to the cluster. Therefore, each cluster name should be
unique in a ClusterSet. Figure 3 shows the standard way of iterating
over clusters and segments. This class design helps data processing
when the methods iterate on each segment of each cluster. This
access fulfills the majority of the algorithms of hierarchical classification.
No constraint exists for the relationship of a given acoustic feature to a segment. An acoustic feature can belong to any one segment or several segments. No mechanism of checking is provided,
and such checking remains the responsibility of the programmer.
ClusterSet
clusterMap
...
Cluster
segmentSet
name
...
Segment
showName
startIndex
length
owner
...
Gaussian
GMM
...
FullGaussian
Accumulator
...
...
DiagGaussian
Accumulator
...
The root class Model stores scores: the likelihood of the current feature, the sum and the mean of likelihoods accumulated by
several invocations of the likelihood computation method. Accumulators should be initialized before use and reset by setting the accumulator to zero. In conjunction with the reset method, those storages
permit to easily compute likelihood of a feature, n features, or segments.
Full and diagonal Gaussians implement an inner class to accumulate the statistics needed during an EM or MAP training. Those
accumulators allow adding or removing weighted acoustic features
to/from the accumulator before generating a new model. An illustration can be found in section 4.2.
4.2. Algorithms
Training
During the training of a model, the features indexed in the cluster
are scanned and EM statistics are stored in the accumulators of the
model. After the accumulation, we estimate the model by EM or
MAP using the accumulators. The same algorithm is employed for
training a mono-Gaussian model and for a GMM. Figure 5 illustrates
an iteration of the EM algorithm.
// Initialize the statistic and score accumulators
gmm.initStatisticAccumulator();
gmm.initScoreAccumulator();
for(Segment segment: cluster) {
featureSet.setCurrentShow(segment.getShowName());
// Iterate over features of the segment
for(int i = 0; i < segment.getLength(); i++) {
float[] feature = featureSet.getFeature(segment.getStart()+i);
// Get the likelihood of the feature (1)
double lhGMM = gmm.getAndAccumulateLikelihood(feature);
for (int j = 0; j < gmm.getNbOfComponents(); j++) {
// Read the likelihood of component j computed in (1)
double lhGaussian = gmm.getComponent(j).getLikelihood();
// Add weighted feature in the accumulator of the GMM
gmm.getComponent(j).addFeature(feature, lhGaussian / lhGMM);
}
}
}
// Compute the model according to the statistics
gmm.setModelFromAccululator();
// Reset the statistic and score accumulators
gmm.resetStatisticAccumulator();
gmm.resetScoreAccumulator();
[3] J. M. Pardo, X. Anguera, and C. Wooters, Speaker diarization for multiple-distant-microphone meetings using several
sources of information, IEEE Transactions on Computers,
vol. 56, no. 9, pp. 12121224, 2007.
[4] Trung Hieu Nguyen, Eng Siong Chng, and Haizhou Li, T-test
distance and clustering criterion for speaker diarization, in
Interspeech 2008, September 2008.
[5] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and
L. Besacier, Step-by-step and integrated approaches in broadcast news speaker diarization, Computer Speech and Language, vol. 20, no. 2-3, pp. 303330, 2006.
[6] M. Siegler, U. U. Jain, B. Raj, and R.M. Stern, Automatic
segmentation, classification and clustering of broadcast news
audio, in the DARPA Speech Recognition Workshop, Chantilly, VA, USA, February 1997.
[7] NIST, Fall 2004 rich transcription (RT-04F) evaluation plan,
August 2004.
[8] S. Chen and P. Gopalakrishnan, Speaker, environment and
channel change detection and clustering via the bayesian information criterion, in DARPA Broadcast News Transcription
and Understanding Workshop, Landsdowne, VA, USA, February 1998.
[9] S. E. Tranter and D. A. Reynolds, Speaker diarisation for
broadcast news, in Odyssey 2004 Speaker and Language
Recognition Workshop, 2004.
[10] J. Pelecanos and S. Sridharan, Feature Warping for robust
speaker verification, in 2001: a Speaker Odyssey. The Speaker
Recognition Workshop (ISCA, Odyssey 2001), Chania, Crete,
June 2001, pp. 213218.
[11] D. A. Reynolds, E. Singer, B. A. Carlson, G. C. OLeary, J. J.
McLaughlin, and M. A. Zissman, Blind clustering of speech
utterances based on speaker and language characteristics, in
Proceedings of International Conference on Spoken Language
Processing (ICSLP 98), 1998.
[12] A. Solomonoff, A. Mielke, M. Schmidt, and H. Gish, Clustering speakers by their voices, in Proceedings of International
Conference on Acoustics Speech and Signal Processing (IEEE,
ICASSP 98), Seattle, Washington, USA, May 1998.
[13] V-B. Le, O. Mella, and D. Fohr, Speaker diarization using
normalized cross-likelihood ratio, in Interspeech 2007, 2007.
[14] M. Ben, M. Betser, F. Bimbot, and G. Gravier, Speaker
diarization using bottom-up clustering based on a parameterderived distance between GMMs, in Proceedings of International Conference on Spoken Language Processing (ISCA,
ICSLP 2004), Jeju, Korea, Oct. 2004.
[15] S. Galliano, G. Gravier, and L. Chaubard, The ESTER 2
evaluation campaign for the rich transcription of French radio
broadcasts, in Interspeech 2009, September 2009.
[16] L. Devillers, H. Bonneau-Maynard, S. Rosset, P. Paroubek,
K. Mctait, D. Mostefa, K. Choukri, L. Charnay, C. Bousquet, N. Vigouroux, F. Bechet, Romary L., J.-Y. Antoine,
J. Villaneau, M. Vergnes, and J. Goulian, The French MEDIA/EVALDA project: the evaluation of the understanding capability of spoken language dialogue systems, in LREC, Language Resources and Evaluation Conference, Lisboa, Portugal, May 2004, pp. 21312134.
[17] V. Jousse, C. Jacquin, S. Meignier, Y. Est`eve, and B. Daille,
Automatic named identification of speakers using diarization
and ASR systems, in ICASSP, Taipei, China, 2009.