Multi-Feature Fusion For Closed Set Text Independent Speaker Identification

Multi-feature Fusion for Closed Set Text Independent Speaker Identification
Gyanendra K. Verma
Indian Institute of Information Technology, Allahabad Jhalwa, Allahabad, India gyanendra@iiita.ac.in
Abstract. An intra-modal fusion, a fusion of different features of the same modal is proposed for speaker identification system. Two fusion methods at feature level and at decision level for multiple features are proposed in this study. We used multiple features from MFCC and wavelet transform of speech signal. Wavelet transform based features capture frequency variation across time while MFCC features mainly approximate the base frequency information, and both are important. A final score is calculated using weighted sum rule by taking matching results of different features. We evaluate the proposed fusion strategies on VoxForge speech dataset using K-Nearest Neighbor classifier. We got the promising result with multiple features in compare to separate one. Further, multi-features also performed well at different SNRs on NOIZEUS, a noisy speech corpus. Keywords: Multi-feature fusion, intra-modal fusion, speaker identification, MFCC, wavelet transform, K-Nearest Neighbor (KNN).
1 Introduction
Multiple information fusion is a new challenge nowadays. Information fusion is defined as combine information from multiple sources in order to achieve higher performance than the performance achieved by means of a single source [1]. There are basically two fusion categories. Intra-modal fusion [2, 3]: this is the fusion of different features of the same modal and Multimodal Fusion [4, 5]: this is the fusion of different modalities e.g. combined face, speech; fingerprint etc. this paper is based on the intra-modal fusion. Further the information can be fused at signal level, feature level and decision level. In signal level, data acquired from different sources to be fused directly after preprocessing. Feature level: fusion of multiple features is performed in this case. Decision level: the output of multiple classifiers based on a set of match score is fused. Complementary information is useful in fusion process as it enhance the confidence in decision [6]. Various methods have been developed for speaker identification such as Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), Linear Prediction Cepstral Coefficients (LPCC) and Gabor etc. however there are still open
S. Dua, S. Sahni, and D.P. Goyal (Eds.): ICISTM 2011, CCIS 141, pp. 170179, 2011. Springer-Verlag Berlin Heidelberg 2011
171
problems which arise into real application. Longbiao Wang et al. [7] proposed a combined approach using mfcc and phase information for feature extraction from speech signal. Most of the algorithms considered only single features or directly combined features. A multi-features based fusion method is proposed in this study for closed set text independent speaker identification in order to improve the performance of the system. Closed-set identification only considers the best match from the enrolled speakers. We have used two feature extraction approaches namely MFCC and Wavelet Transform. The cepstral representation is a better way to represent the local spectral properties of a speech signal [8] whereas wavelet transform capture frequency variation across time. Features obtained from the above two approaches was fused at feature and decision level. A combined feature is generated by fusion of different features of the same speech at feature level fusion. At decision level, a final score is calculated using weighted sum rule by taking matching results of different features. VoxForge and Noizeus speech corpus has been used to evaluate the fusion schemes. This study contributes to development of new data fusion methods in signal processing and information extraction. The multi-feature fusion approach can be beneficial to many application of pattern recognition in order to enhance the performance of the system under consideration of pros and cons of the system. A general architecture of feature and decision level information fusion is illustrated in Figs.1a and b.
Data
Feature Extractor Feature Extractor Feature Extractor Decision Maker Decision Maker Decision Maker
Decision
(a)
Data
Feature Extractor Feature Extractor Feature Extractor Decision Maker
Decision
(b)
Fig. 1. A general architecture of Information Fusion (a) Feature level (b) Decision level
172
G.K. Verma
The rest of the paper is organized as follows: review of feature extraction techniques are given in Section 2. Proposed fusion approach is described in Section 3. Experiment results and discussion about proposed work is described in Section 4. Concluding remark is given in Section 5.
2 Feature Extraction
Feature extraction is an important phase in any pattern recognition problem. In our study the features are obtained by applying two approaches named MFCC and wavelet transform. Wavelet transform is able to perform local analysis to capture the local information of a signal at multi-resolution. Feature vectors extraction process using MFCC and wavelet transform are described below. 2.1 Feature Extraction Using MFCC The process of calculating the MFCC consists of the following steps. Framing: In this step the speech signal segmented into N samples with overlapping frames. Windowing: To spectral analysis of the speech signal in order to minimize the spectral distortion. Generally Hamming window is used as given below
W ( n ) = 0 .54 0 .46 cos( 2 n ) N 1
(1)
where 0 <= n <= N-1 Fast Fourier Transform (FFT): FFT converts each frame from the time domain into the frequency domain. Mel-frequency wrapping: the Mel can be obtained for a given frequency by the formula given below. f Mel ( f ) = 2595 log 10 (1 + ) (2) 700 Cepstrum: Finally take the discrete cosine transform (DCT) to the signal in order to obtain MFCC coefficients. 2.2 Feature Extraction Using Wavelet Transform Wavelet transform provides a compact representation that shows the energy distribution of the signal in time and frequency [9]. We have used Discrete wavelet transform to decomposes the signal into multilevel successive frequency band utilizing two sets of function called scaling function and wavelet functions ( ) associated with low pass and high pass filters respectively [10]. Filters are obtained as
173
Fig. 2. Feature extraction process
weighted sum of scaled and shifted versions of scaling function itself. Information captured by wavelet transform depends on properties of wavelet function family like Daubechis, Symlet, Biorthogonal, Coiflet etc and properties (waveform) of target signal. Information in signal extracted by wavelet transforms using different family of wavelet function need not to be same. It is required to choose or evaluate wavelet function that provides more useful information for particular application. Signal at various scales and translations providing multi resolution time-frequency representation, as show in Fig. 3. In Discrete wavelet decomposition of signal, the output of high band pass filter and low band pass filter can be represented mathematically by the Equations 3 and 4.
Yhigh [ k ] =
nX [ n ] g [ 2 k 1]
(3) (4)
Y low [ k ] = nX [ n ]h[ 2 k 1]
where Yhigh and respectively.
Ylow are the outputs of the high band pass and low band pass filters,
174
G.K. Verma
Fig. 3. Schematic of Discrete Wavelet decomposition of a speech signal
In order to extract wavelet coefficients, the speech signal is passed into successive high pass and low pass filter. Selection of suitable wavelet and number of levels of decomposition is important. For one dimensional speech signal Daubechis wavelet family provides good results for non-stationary signal analysis [11] so we have used it in our study. The feature vectors obtained from six level wavelet coefficients provides compact representation of the signal. The coefficients occur in whole bandwidth from low frequency to high. The original signal can be represented by the sum of coefficients in every sub band, which is cD6, cD5, cD4, cD3, cD2, cD1. Feature vectors are obtained from the detailed coefficients applying common statistics and entropy. The discriminatory property of entropy features makes it suitable to extract frequency distribution information [12].
3 Proposed Fusion Approach

We proposed two fusion approaches for multiple features. The first approach fuse information at feature level (Fig. 4) and other one fuse information at decision level (Fig. 5). Low level features of the speech signal are extracted independently using mfcc and wavelet transform analysis described in Sections 2.1 and 2.2, respectively. The fusion strategies are discussed below. 3.1 Feature-Level Fusion In feature level information fusion the features obtained from both approaches are organized in such a way that the mfcc features remain in the first half of feature vector and the wavelet features are into the second half of the feature vector. Let the features obtained from mfcc coefficients are F mfcc = ( f m 1 , f m 2 ,.... f mn ) and from wavelet coefficients are F wav = ( f w1 , f w 2 ,.... f wn ) then the fused feature vector can be given as F fusion = [ F mfcc , F wav ]
F fusion = { f m 1 , f m 2 ,.... f mn , f w1 , f w 2 ,.... f wn }

3.2 Decision-Level Fusion In decision-level, we start the procedure with normalization of the scores obtained from different feature extraction approaches. Normalization is performed to map the
175
Fig. 4. Feature level fusion architecture
values of different classifiers in common range. Min-max normalization is used here. The threshold value for different classifier is different so we further rescale the matching score in order to obtain the same threshold value for different classifier. A speaker is accepted only within threshold range otherwise rejected. Finally the scores are combined using sum rule. The sum rule method of integration takes the weighted average of the individual score values. Let x1 , x2 , xn are the weighted scores corresponding to classifier 1, 2 and n . Then the fusion equation can be given by Equation 5.
S comb =
i =1
Xi n
(5)
4 Experimental Results and Discussion

VoxForge corpus [13]: It containing more than 200 speaker profiles of males and females and each profile contains 10 speech samples. The sampling frequency was kept 8 KHz and bit depth 16. The duration of speech samples range between 2 - 10 second. All the speech files are in wav format.
176
G.K. Verma
Fig. 5. Decision-Level Fusion Architecture
NOIZEUS [14]: The noisy database contains 30 IEEE sentences (produced by three male and three female speakers) the speech was recorded from different places of crowd of people, car, exhibition hall, restaurant, street, air port, train station and train. The noises were added to the speech signals at SNRs of 15dB, 10dB and 5dB. All files are in wav format (16 bit PCM, mono). The experiments comprised of two modules: training and testing were performed on standard VoxForge speech corpus and NOIZEUS. Five speech samples were used for training and another five for testing purpose. Total 33 and 30 dimensional feature vectors was obtained from MFCC and Wavelet decomposition respectively as described in section II a and b. Min-max algorithm has been used for feature set normalization in order to improve the identification accuracy before the classification for large dataset. All the experiments are performed on Mat Lab 7.6 (R2008b). For classification purpose speech samples of same speaker is assigned same class. In this way five speech samples assigned the same class and so on such that speaker A = A1, A2, A3, A4, A5 assign class 1 and for speaker B = B1, B2, B3, B4, B5 assign class 2. In this way the whole training data is grouped in class. Euclidean distance is used to calculate the distances among vectors using KNN algorithm. The performance result of discrete wavelet and mfcc features is shown in Table 1 for standard VoxForge speech corpus. The proposed design of the speaker identification system uses 33 dimensional wavelet features and 30 dimensional mfcc features with 10 samples of 200 speakers. The parameters of the proposed design are the result of
177
Table 1. Classification results with multi-feature No. of Speakers Wavelet 10 20 30 40 60 80 100 120 140 160 180 200 100 99.0 94.0 89.5 87.0 88.0 89.4 88.1 85.8 85.5 83.7 83.9 Classification Rate (%) MFCC 96 91 84 81 74.6 73 73.6 72.8 68.4 68 67.33 66.8 Fusion 100 100 95 92 90.6 92.5 93.2 91.8 90.6 90.4 90.4 90.2
100 Classification Accuracy (%) 95 90 85 80 75 70 65 10 20 30 40
Wavelet MFCC Fused
60
80 100 120 140 Number of Speakers
160
180
200
Fig. 6. Performance Graph
178
G.K. Verma
evaluation of different speaker identifications designs, evaluated using the VoxForge and Alternative corpora. The classification accuracy of fused features is 90.2% with 200 speakers. A threshold is used to assign the class of speakers, here = 0.60. If the query samples matched and all matched samples belongs to same class and if x (score) > then assign the query sample to that class. The classification results are shown in Table 1 and the corresponding performance graph is illustrated in Fig. 6. The performance of the system with noisy dataset at different SNRs is illustrated in Fig. 7.
Classification Accuracy (%)
100
90
80 Wavelet MFCC Fused
70
60 15 dB
10 dB Noise Level
Fig. 7. Performance with Noisy Speech at Different SNRs
5 dB
5 Conclusions
New fusion methods at feature level and at decision level for multiple features were discussed and experimented in this study. Features extracted from speech signals using MFCC and Wavelet Transform were either combined to form a hybrid feature space before classification or classified separately before being combined using a rule-based approach. Examination of the above feature fusion strategies for improving speaker identification provides good results. MFCC and Discrete wavelet transform is used to extract multiple features from speech signal. Multiple features were fused at feature as well as decision level. KNN classifier has been used for similarity measure between extracted features and a set of reference features. All the experiments were performed on standard speech corpus i.e. VoxForge and NOIZEUS. The results obtained from fusion scheme shows significant increment in performance of the system.
179
References
1. Multimodel Data Fusion, http://www.multitel.be/?page=data 2. Marcel, S., Bengio, S.: Improving face verification using skin color information. In: 16th International Conference on Pattern Recognition, pp. 378381 (2002) 3. Czyz, J., Kittler, J., Vandendorpe, L.: Multiple classifier combination for face-based identity verification. Pattern Recognition 37(7), 14591469 (2004) 4. Wang, Y., Tan, T., Jain, A.K.: Combining face and iris biometrics for identity verification. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688. Springer, Heidelberg (2003) 5. Hong, L., Jain, A.K., Pankanti, S.: Can multi-biometrics improve performance? In: Technical Report MSU-CSE-99-39, Department of Computer Science, Michigan State University, East Lansing, Michigan (1999) 6. An Introduction to Data Fusion, Royal Military Academy, http://www.sic.rma.ac.be/Research/Fusion/Intro/content.html 7. Wang, L., Minami, K., Yamamoto, K., Nakagawa, S.: Speaker identification by combining MFCC and phase information in noisy environments. In: 35th International Conference on Acoustics, Speech, and Signal Processing, Dallas, Texas, U.S.A. (2010) 8. Patel, I., Srinivas Rao, Y.: A Frequency Spectral Feature Modeling for Hidden Markov Model Based Automated Speech Recognition. In: Meghanathan, N., Boumerdassi, S., Chaki, N., Nagamalai, D. (eds.) NeCoM 2010. CCIS, vol. 90, pp. 134143. Springer, Heidelberg (2010) 9. Dutta, T.: Dynamic time warping based approach to text dependent speaker identification using spectrograms. Congress on Image and Signal Processing 2, 354360 (2008) 10. Tzanetakis, G., Essl, G., Cook, P.: Audio analysis using the discrete wavelet transform. In: The Proceedings of Conference in Acoustics and Music Theory Applications, Skiathos, Greece (2001) 11. Toh, A.M., Togneri, R., Northolt, S.: Spectral entropy as speech features for speech recognition. In: The Proceedings of PEECS, Perth, pp. 2225 (2005) 12. VoxForge Speech Corpus, http://www.voxforge.org 13. NOIZEUS: A Noisy Speech Corpus for Evaluation of Speech Enhancement Algorithms, http://www.utdallas.edu/~loizou/speech/noizeus/

Multi-Feature Fusion For Closed Set Text Independent Speaker Identification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Feature Fusion For Closed Set Text Independent Speaker Identification

Uploaded by

Copyright:

Available Formats

Multi-feature Fusion for Closed Set Text Independent Speaker Identification