You are on page 1of 6

2006 International Joint Conference on Neural Networks

Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada


July 16-21, 2006

Kernel based Clustering and Vector Quantization for


Speech Segmentation
D. Srikrishna Satish and C. Chandra Sekhar

Abstract In this paper, we propose an approach to segmen- consonant regions, transition regions and vowel regions in
tation of continuous speech into syllable-like units where each the speech signal. We propose a clustering based approach
unit has one or more consonants followed by a vowel. The to identify the different types of regions in the speech
proposed approach uses the clustering and vector quantization
methods to identify the consonant, transition and vowel regions signal. The clustering approach involves forming clusters in
in continuous speech. We consider methods based on clustering the space of speech parametric vectors that represent the
and vector quantization in the Mercer kernel feature space for spectral characteristics of the speech signal. The clusters are
separation of nonlinearly separable clusters of data belonging nonlinearly separable for the data of consonants and vowels
to the different regions. Results of experimental studies demon- that have similar spectral characteristics. The commonly used
strate the effectiveness of the kernel based methods in improving
the performance of the speech segmentation system. K-means clustering method gives a linear separation of data
and is not suitable for separation of nonlinearly separable
I. I NTRODUCTION data. In this paper, we explore the Mercer kernel based
Continuous speech recognition involves speech signal-to- methods for clustering of nonlinearly separable data. These
symbol transformation and symbol-to-text conversion, where methods involve a nonlinear transformation from the input
symbols correspond to subword units of speech. Syllable- space to a higher dimensional feature space induced by the
like units are commonly used as symbols [1]. One approach innerproduct kernels or Mercer kernels. Nonlinear mapping
to developing the speech signal-to-symbol transformation by the kernels is expected to lead to linear separability of data
module is based on segmentation and labeling. This approach in the kernel feature space. Then a linear separation formed
involves detection of boundaries of subword unit segments by clustering in the kernel feature space would correspond
in the given continuous speech signal and then using a to a nonlinear separation in the input space. We also address
subword unit classifier to assign a label for each segment. issues in performing vector quantization in the kernel feature
Among different forms of syllable-like units, the C + V units space. The methods for kernel based clustering and vector
have a high frequency of occurrence. A C + V unit consists quantization are used for segmentation of continuous speech
of one or more consecutive consonants (C) followed by a into C + V units.
vowel (V ). The most common forms of C + V units are CV , The paper is organized as follows: The method for clus-
CCV , and CCCV in the decreasing order of frequency of tering in Mercer kernel feature space is explained in Section
occurrence. Approaches based on signal processing methods II. The method for vector quantization in the kernel feature
have been developed for detection of syllable boundaries in space is described in Section III. The proposed approach for
continuous speech [2]. In this paper, we propose a clustering detection of boundaries of C + V units in continuous speech
based approach for detection of boundaries of C + V units in is presented in Section IV. Experimental studies on detection
continuous speech. of boundaries of C + V units are presented in Section V.
The clustering based approach to detection of boundaries II. C LUSTERING IN K ERNEL F EATURE S PACE
of C + V units involves identification of the consonant regions
and vowel regions in the continuous speech signal. The con- Consider a set of N data points in the input space, xn ,
sonant and vowel regions have different gross characteristics n = 1, 2, ...., N . Let the number of clusters to be formed is
of speech signal. Different categories of consonants such as K. The criterion used by the K-means clustering method in
stop consonants, nasals, semivowels and fricatives also have the input space for grouping the data into K clusters is to
differences in their gross characteristics of speech signal. minimize the trace of the within-cluster scatter matrix, Sw ,
Production of speech for a CV unit involves transition from a defined as follows [3]:
K N
consonant to a vowel. The characteristics of the signal for the 1 
Sw = zkn (xn mk )(xn mk )T (1)
transition region are different from that of the consonant and N n=1
k=1
vowel regions. The characteristics of the transition region are
also different for different CV units. Therefore segmentation where mk is the center of the k th cluster, Ck , and zkn is
of continuous speech utterance involves identification of the membership of data point xn to the cluster Ck . The
membership value zkn = 1, if xn Ck and 0 otherwise. The
D. Srikrishna Satish is with the Department of Computer Science and number of points in the k th cluster is given as Nk defined
Engineering, Indian Institute of Technology Madras, Chennai-36, India
C. Chandra Sekhar is with the Department of Computer Science and
by: N
Engineering, Indian Institute of Technology Madras, Chennai-36, India Nk = zkn (2)
(email: chandra@cs.iitm.ernet.in). n=1

0-7803-9490-9/06/$20.00/2006 IEEE 1636


The center of the cluster Ck is given as mk defined by:  
N xn xl 2
1  Knl = K(xn , xl ) = exp (10)
mk = zkn xn (3) 2
Nk n=1
2 is the kernel width
where  parameter. Noting that Knn = 1,
The optimal clustering of the data points involves determin- N K
Nk = n=1 zkn and N = k=1 Nk , Eq. 8 is rewritten as
ing the K-by-N indicator matrix, Z, with the elements as K
zkn , that minimizes the trace of the matrix Sw . The K- T r(S
w) = 1 k R(x|Ck ) (11)
means clustering in the input space is suitable for separation k=1
of linearly separable data in the input space. For proper Therefore, the problem of minimization of the trace of
separation of nonlinearly separable data in the input space, S
w is the same as the problem that involves maximization
we consider the K-means clustering in the feature space of of compactness of the clusters in the kernel feature space,
a Mercer kernel. and it involves finding the indicator matrix such that
For a Mercer kernel that uses a continuous nonlinear K
mapping, , the kernel function is given by: Z = arg min T r(S w ) = arg max k R(x|Ck ) (12)
Z Z
k=1
Knl = K(xn , xl ) = T (xn )(xl ) (4)
For formulation of the above problem as an optimization
For clustering in the kernel feature space, the criterion for problem, Eq. 8 is rewritten as
optimal partitioning is to minimize the trace of the within-  
cluster scatter matrix in the feature space, S w . The feature 1 
K N
1 
N

space scatter matrix is given by: T r(Sw ) = zkn Knn zkl Knl
N n=1
Nk
k=1 l=1
K N
1  1 K N
S
w = zkn ((xn ) m T
k )((xn ) mk ) (5) = zkn Dkn (13)
N n=1 N
k=1 k=1 n=1

where m k , the center of the k


th
cluster in the feature space where 1 
N
is given by: N Dkn = Knn zkl Knl (14)
1  Nk
m k = zkn (xn ) (6) l=1
Nk n=1
The term Dkn is the penalty associated with assigning xn to
The trace of the scatter matrix S
is as given below: the k th cluster in the kernel feature space. The second term
w
K  N in Eq. 14 corresponds to the average similarity of (xn )
1
T r(S
w) = zkn ((xn ) m T
k ) ((xn ) mk )

with the vectors in the k th cluster in the feature space.
N
k=1 n=1 It is shown in [3] that the stochastic method for finding
(7) the optimal values of elements of Z leads to the following
The trace of S
w can be computed using the kernel operations iterative procedure:
only as given below [3] [4]: k exp(2Dkn new
)
zkn  = K (15)
new
k =1 k exp(2Dk n )
K N
1 
T r(S
w) = zkn (T (xn )(xn )
N n=1
and 1 
N
k=1 new
T Dkn = Knn zkn Knl (16)
2T (xn )m Nk 
k + mk mk ) l=1
K N
1  where
= zkn Knn k = exp(R(x|Ck )) (17)
N n=1
k=1
K N N
The parameter controls the softness of the assignments of
 Nk 1   the data points to the clusters during the optimization [5].
2 zkn zkl Knl
N Nk n=1 The terms zkn , Nk  and R(x|Ck ) denote the estimates
k=1 l=1
K N K of the expected values of zkn , Nk and R(x|Ck ) respectively.
1  
The iterative procedure in Eq. 15 is continued until there is a
= zkn Knn k R(x|Ck ) (8)
N n=1 convergence, i.e., there is no significant change in the values
k=1 k=1
of elements of the indicator matrix Z.
where Nk
k = We consider an example of nonlinearly separable data to
N demonstrate the limitation of the K-means clustering in the
N N
and 1  input pattern space and the effectiveness of the kernel based
R(x|Ck ) = zkn zkl Knl (9) clustering method in separation of nonlinearly separable data
Nk2 n=1
l=1 in the input space. The scatter plot of nonlinearly separable
Here the term R(x|Ck ) provides a measure of compactness data corresponding to two interlocking clusters in a 2-
of the k th cluster, Ck . dimensional space is shown in Figure 1(a). In this example,
Let us consider the Gaussian kernel defined below: the data of two classes are separable by an S shaped curve

1637
in the input space. The linear separation of data formed by is no explicit representation of the mean vector of a cluster.
the K-means clustering method in the input space is shown For performing vector quantization in kernel feature space,
in Figure 1(b). The separations obtained using the kernel a vector x is assigned the index of the cluster whose center
based clustering method in the feature space of Gaussian has the highest similarity to (x). The similarity measure
kernel for different values of 2 are shown in Figures 1(c) between the vector x and the center of a cluster, Ck , in the
and 1(d). It can be seen from Figure 1(d) that the points of feature space can be computed as follows:
1 
the two nonlinearly separable classes are perfectly separated sk (x) = T (x)m k = K(x, xi ) (18)
for 2 =0.18. This example demonstrates the potential of the Nk
(xi )Ck
kernel based clustering method to separate the nonlinearly
separable data in the input space. Computation of sk (x), k = 1, 2, ..., K, involves performing
the kernel operation between x and each of the N data
vectors used in formation of the clusters. Therefore vector
1.5 1.5
quantization in the kernel feature space is computationally
intensive. In order to reduce the number of kernel operations
1 1

in vector quantization, we propose a method in which a


0.5 0.5
cluster in the feature space is represented by a subset of
0 0 its vectors close to its center [6]. The similarity measure in
0.5 0.5
Eq. 18 is applied to each vector in a cluster to compute its
similarity to the center of the cluster. The vectors with high
1 1
similarity are considered to form a core of the cluster. A core
1.5
1.5 1 0.5 0 0.5 1 1.5 2 2.5 3
1.5
1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 of the cluster is a set of vectors closest to the center of the
cluster.
(a) Scatter plot of the data (b) Linear separation of
Let S be the size of a core of the cluster used to represent
of two interlocking clusters the data obtained using a cluster. If the mean vector of S vectors in the core is
in a 2-dimensional space. K-means clustering in the approximately the same as the mean vector of all the data
The bottom cluster belongs input space.
to class 1 and the top cluster
vectors in a cluster, then the center of the cluster can be
belongs to class 2. represented by the center of the core of the cluster, for the
purpose of vector quantization. The similarity measure sk (x)
is expected to be close to the similarity measure of x to the
1.5 1.5
mean vector of the core of the cluster Ck . For S  Nk , this
1 1 method leads to a significant reduction in the computational
0.5 0.5
complexity of vector quantization in the kernel feature space.
In the next section, we present an approach that uses clus-
0 0
tering and vector quantization for segmentation of continuous
0.5 0.5 speech.
IV. D ETECTION OF C + V U NIT B OUNDARIES
1 1

An utterance of a CV unit consists of a consonant region,


1.5 1.5
1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3

a transition region, and a vowel region in that sequence.


(c) Improper separation (d) Nonlinear separation The gross spectral characteristics of these regions in the
obtained using kernel based obtained using kernel based signal are different. Short-time spectral analysis of the speech
clustering in the feature clustering in the feature
space of Gaussian kernel, space of Gaussian kernel, signal gives a sequence of speech parametric vectors that
for the values of 2 =1.0 and for the values of 2 =0.18 represent the spectral characteristics of the signal. The data
=18. and =18. set consisting of the spectral feature vectors of several
utterances of a particular CV class is clustered into three
Fig. 1. Illustration of clustering in the input space and the kernel groups. The vectors within a particular group or cluster are
feature space for the interlocking clusters.
expected to be similar. Hence, the three clusters so formed
are expected to represent the consonant region, transition
III. V ECTOR Q UANTIZATION IN
region and vowel region of the utterances of a CV class.
K ERNEL F EATURE S PACE
A codebook of size 3 is constructed for each of the M
Clustering in the input space leads to the formation of a CV classes considered, using the data of CV segments
codebook that consists of the mean vectors of the clusters excised from continuous speech. The M codebooks are used
formed. The mean vectors that represent the clusters are in detection of boundaries of C + V segments in a given
used to perform vector quantization in the input space. It continuous speech utterance.
involves computation of the distance between an input space Clustering of the data of a CV class can be done using the
vector x and each of the mean vectors in the codebook. K-means clustering method in the input space or the kernel
For the clusters formed in the kernel feature space, there based clustering method in the kernel feature space. For CV

1638
Continuous speech signal for a sentence
classes with stop consonants, affricates or fricatives as conso-
nants, the characteristics of the signal in different regions of
the CV unit are significantly different. However, for the CV
classes with nasals or semivowels as consonants, the different
regions in the CV unit have similar characteristics leading to Short time spectrum analysis

formation of nonlinearly separable clusters. We demonstrate Sequence of feature vectors


the effectiveness of the kernel based clustering method in
Consonant Transition Vowel
obtaining a better separation of the data of different regions region region region

of CV classes with nasals or semivowels as consonants.


After clustering the data set of a CV class, the clusters
are labelled using the following method. For each utterance Codebook1 Sequence of
indices
of the CV class in its training data set, the speech signal is
processed to obtain a sequence of spectral parametric vectors.
The sequence is divided into three subsequences of equal Codebook2 Decision
length. The three subsequences are assumed to correspond
: : : Logic

to the consonant region, transition region and vowel region


respectively. The codebook of the CV class is used in vector
Sequence of
quantization of each of the vectors in the sequence of speech CodebookM region hypotheses
parametric vectors of an utterance to obtain a sequence of
codebook indices. The distribution of 3 codebook indices is Detection of boundaries of C+ V segments
obtained for the indices of the subsequences in the consonant
regions of all the utterances of the class. The codebook index
that has the highest frequency of occurrence is associated
with the cluster for the consonant region of that class. In a
similar way, the distributions of indices for the subsequences Fig. 2. Block diagram of the system for detection of C + V unit boundaries.
in the transition region and the vowel region are used to
determine the codebook index associated with each region.
The association of regions with the indices of clusters is used
been considered for clustering and building codebooks. For
to label the three clusters as Consonant (C), Transition (T ),
each of the 86 CV classes, the speech data of segments of
or Vowel (V ).
that class excised from continuous speech is used in building
Now, we present the approach to segmentation of continu-
the codebook for that class. The K-means clustering method
ous speech of a sentence into C + V units. For the continuous
is used for clustering in the input space. The kernel based
speech signal of the given utterance of a sentence, short-time
clustering method is used for clustering of the data of each
spectral analysis is carried out to obtain a sequence of speech
CV class in the kernel feature space of Gaussian kernel. A
parametric vectors. Vector quantization is performed for each
value of 500 is used for the kernel width parameter, 2 . This
vector in the sequence using each of the M codebooks.
value of the kernel width parameter is suitable for the data
Vector quantization of a vector x in the sequence using
of many CV classes. For vector quantization in the kernel
the j th codebook gives the label, ij . The region for x is
feature space, the size of the core of a cluster, S, is chosen
hypothesized as C, T , or V , by determining the label with
as 50.
the highest frequency of occurrence in the set {i1 ,i2 ,...,iM }.
Thus a sequence of region hypotheses is obtained for the The performance of the systems for segmentation of
sequence of speech parametric vectors of the utterance continuous speech is evaluated on the speech data of 200
of a sentence. A typical sequence of region hypotheses sentences consisting of 2659 C + V segments. The speech
is CCCT T V V V V V V CCCT T T V V V V CCT T T V V V V . data used for evaluation is different from the data used in
In this sequence of region hypotheses, the points at which building the codebooks. For analysis of the performance
there is a change from V to C are hypothesized as the of the segmentation systems, the boundaries of the C + V
boundaries of C + V units. The block diagram of the proposed segments have been marked manually. For each sentence in
method for detection of C + V unit boundaries in continuous the test data set, the hypotheses of C + V unit boundaries are
speech is given in Figure 2. obtained using the methods presented in the previous section.
For a chosen threshold on deviation, , the hypotheses with
V. S TUDIES ON D ETECTION OF C + V S EGMENT a maximum deviation of from the actual boundaries are
B OUNDARIES considered as the matching hypotheses. The actual bound-
In this section, we present our studies conducted on the aries for which there is no hypothesis with a deviation less
continuous speech database of an Indian language, Tel- than are considered as the missing hypotheses. When
ugu [7]. This database includes speech data of television there are two or more hypotheses with a deviation less
broadcast news recordings. The 86 CV classes that have a than , the hypothesis with the least deviation is considered
frequency of occurrence of at least 100 in the database have as the matching hypothesis, and the other hypotheses are

1639
TABLE I TABLE II
P ERFORMANCE OF THE SYSTEM FOR DETECTION OF BOUNDARIES OF N UMBER OF MATCHING HYPOTHESES IN DETECTION OF BOUNDARIES
C + V UNITS USING DIFFERENT METHODS OF CLUSTERING AND VECTOR OF C + V UNITS OF DIFFERENT CATEGORIES OF CONSONANTS .
QUANTIZATION (VQ) AND FOR DIFFERENT VALUES OF GIVEN IN Category No. of actual No. of matching hypotheses
MILLISECONDS . boundaries Input space Kernel based
Matching Spurious clustering clustering
Method hypotheses (in %) hypotheses (in %) Stop consonants 1110 907 901
=10 =20 =30 =10 =20 =30 Fricatives 316 293 262
Clustering Nasals 336 82 245
and VQ Semivowels 654 170 352
in the 47.62 58.41 62.24 40.99 27.68 22.87
input
space
Clustering
the system using the second level hypothesization is given
and VQ
in the 55.51 67.08 73.34 44.9 33.21 27.02
in Table III for the kernel based clustering method. For
kernel
comparison, the performance of the system without using
feature
the second level hypothesization is also given in Table III. It
space is seen that the second level hypothesization is effective in
increasing the percentage of matching hypotheses by about
10.5%.

considered as the spurious boundaries. The hypotheses with TABLE III


a deviation greater than from the nearest actual boundaries P ERFORMANCE OF THE SYSTEM FOR SEGMENTATION OF CONTINUOUS
are also considered as spurious hypotheses. The performance SPEECH , WITHOUT AND WITH THE SECOND LEVEL HYPOTHESIZATION .
of the system for segmentation of continuous speech using P ERFORMANCE IS GIVEN FOR KERNEL BASED CLUSTERING AND A
different methods for clustering and vector quantization is VALUE OF 30 MSEC FOR .
given in Table I. The performance of each method is given
System for Matching Spurious
for different values of the threshold on deviation .
segmentation hypotheses (in %) hypotheses (in %)
It is seen that the segmentation system based on clus-
Without the second 73.34 27.02
tering and vector quantization in the kernel feature gives a
level hypothesization
significantly better (by about 8% to 11%) performance in
With the second 83.88 31.03
detection of C + V unit boundaries. For speech recognition
level hypothesization
tasks based on segmentation and labeling approach, it is
important that the boundaries are detected accurately. The
spurious boundaries can be eliminated during the labeling
phase in speech recognition. Boundary hypotheses obtained by segmentation of con-
An analysis of the performance of the segmentation sys- tinuous speech using the clustering based methods for an
tems in detection of boundaries of C + V units for different utterance of a sentence in Telugu language, /pra dhA ni
categories of consonants is carried out. The number of jA ti ki aN ki tam chE sA ru/ are shown in Fig. 3. The
detected boundaries (matching hypotheses) for 4 categories manually marked boundaries, boundaries hypothesized by
of consonants, namely, stop consonants, fricatives, nasals and clustering and vector quantization in the input space, bound-
semivowels is given in Table II. It is seen that the kernel aries hypothesized by clustering and vector quantization in
based clustering gives a significantly better performance the kernel feature space, and boundaries hypothesized using
in detection of the boundaries of C + V units with nasals the second level hypothesization are shown in Fig. 3. The
and semivowels as consonants. It is mainly due to the arrows in the plots indicate the boundaries not detected. It
better separation of vowel regions from regions of nasal and is seen that 3 of the 13 boundaries are not detected by the
semivowel consonants. input space clustering based system. Only one boundary is
It is noted from Table II that even though the kernel not detected by the kernel based clustering method. This
based clustering gives a better performance than the input missing boundary is also detected using the second level
space clustering, a significant number of boundaries of C + V hypothesization. A similar behavior is seen in the perfor-
units with nasals and semivowels have not been detected. mance of different methods for the utterance of another
A second level hypothesization is done for the segments of sentence /a me ri kA adh yak Shu Du/ shown in Fig. 4.
continuous speech in which the boundaries are not detected. Results of the studies presented in this section demonstrate
For the second level hypothesization, the codebooks of the the effectiveness of the kernel based clustering methods in
CV classes with nasals or semivowels as consonants are only segmentation of continuous speech to detect the boundaries
used in the system for segmentation. The performance of of C + V segments.

1640
VI. C ONCLUSIONS
In this paper, we have proposed a method for detection
/pra/ /dhA/ /ni/ /jA/ /ti/ /ki/ /a~N/ /ki/ / tam/ /chE/ /sA/ /ru/ of boundaries of C + V segments in the given continuous
speech using a clustering and vector quantization based
(a) method. It has been demonstrated that the method based on
the kernel based clustering and vector quantization gives a
better performance than the method based on the input space
(b) clustering and vector quantization. It has been shown that the
use of the kernel based clustering method for separation of
the data of different regions in the CV classes has improved
(c) the performance of the method for speech segmentation.
R EFERENCES
[1] Suryakanth V. Gangashetty, Neural Network Models for Recognition of
(d) Consonant-Vowel Units of Speech in Multiple Languages, Ph.D. thesis,
Indian Institute of Technology Madras, February 2005.
[2] T. Nagarajan, Implicit Systems for Spoken Language Identification,
Ph.D. thesis, Indian Institute of Technology Madras, February 2004.
(e) [3] M.Girolami, Mercer kernel-based clustering in feature space, IEEE
Trans. Neural Networks, vol. 13, no. 3, pp. 780784, May 2002.
[4] D.Srikrishna Satish, Kernel Based Clustering and Vector Quantization
Time for Pattern Classification, M.S. thesis, Indian Institute of Technology
Madras, March 2005.
[5] J. M. Buhmann, Data Clustering and Data Visualisation, Kluwer
Fig. 3. (a) Speech signal waveform for the utterance of the sentence /pra Academic, 1998.
dhA ni jA ti ki a N ki tam chE sA ru/, (b) Manually marked boundaries [6] D. S. Satish and C. C. Sekhar, Kernel based clustering and vector
of C + V segments, (c) Boundaries hypothesized by clustering and vector quantization for speech recognition, in proceedings of IEEE Interna-
quantization in the input space, (d) Boundaries hypothesized by clustering tional Workshop on Machine Learning for Signal Processing, Brazil,
and vector quantization in the kernel feature space, and (e) Boundaries pp. 315324, September 2004.
hypothesized using second level hypothesization. [7] A. Nayeemulla Khan, S. V. Gangashetty, and S. Rajendran, Speech
database for Indian languages - A preliminary study, in Proc. Int.
Conf. Natural Language Processing, December 2002, pp. 295301.

/a/ /me/ /ri/ /kA/ /adh/ /yak/ /Shu/ /Du/

(a)

(b)

(c)

(d)

(e)

Time

Fig. 4. (a) Speech signal waveform for the utterance of the sentence
/a me ri kA adh yak Shu Du/, (b) Manually marked boundaries of C + V
segments, (c) Boundaries hypothesized by clustering and vector quantization
in the input space, (d) Boundaries hypothesized by clustering and vector
quantization in the kernel feature space, and (e) Boundaries hypothesized
using the second level hypothesization.

1641

You might also like