You are on page 1of 60

Voice recognition

Contents
ABSTRUCT ............................................................................................................................................... 3
ACKNOWLEDGEMENT ............................................................................................................................. 4
CHAPTER 1 .............................................................................................................................................. 5
1.0 INTRODUCTION ............................................................................................................................. 5
1.1BACKGROUND ................................................................................................................................ 5
1.2PROBLEM STATEMENT................................................................................................................... 6
1.3JUSTIFICATION OF THE PROJECT ................................................................................................... 6
1.4OBJECTIVES .................................................................................................................................... 6
1.5METHODOLODY ............................................................................................................................. 6
CHAPTER 2 .............................................................................................................................................. 7
2.0 VOICE............................................................................................................................................. 7
2.0INTRODUCTION .............................................................................................................................. 7
2.1 WHAT IS VOICE ............................................................................................................................. 7
2.2VOICE PRODUCTION ...................................................................................................................... 7
2.3VOICE MECHANISM ....................................................................................................................... 8
2.4VOICE CHARACTERISTICS ............................................................................................................... 8
2.5VOICE QUALITIES............................................................................................................................ 9
2.6 VOICE BIOMETRICS ....................................................................................................................... 9
2.7 BIOMETRIC SYSTEM .................................................................................................................... 10
Chapter 3............................................................................................................................................... 11
3.0 INTRODUCTION ........................................................................................................................... 11
3.1 OPEN SET VS CLOSED SET ......................................................................................................... 12
3.2 IDENTIFICATION VS VERIFICATION ............................................................................................. 12
3.3 MODULES .................................................................................................................................... 14
CHAPTER 4............................................................................................................................................ 15
4.0 INTRODUCTION ........................................................................................................................ 15
4.1 PRE-PROCESSING ........................................................................................................................ 16
4.2.1 Frame blocking ..................................................................................................................... 16
4.2.2 Windowing ........................................................................................................................... 17
4.2.3 Truncation ........................................................................................................................... 18
4.2.4 SHORT TERM FAST FOURIER TRANSFORM ......................................................................... 19
4.2.5Mel Frequency Warping........................................................................................................ 20
4.2.6CEPSTRUM ............................................................................................................................ 22
4.2.7 LINEAR PREDICTIVE ANALYSIS. ............................................................................................. 22
1

Voice recognition

4.2.8Linear Predictive Coding Coefficients(LPCC) ......................................................................... 23


4.2.9PERCEPTUAL LINEAR PREDICTIVE (PLP) ................................................................................ 25
4.3.0 Mel Frequency Cepstrum Coefficients.(MFCC) ................................................................... 27
4.3.1DTW ...................................................................................................................................... 29
Advantages of DTW....................................................................................................................... 32
Disadavantages of DWT ................................................................................................................ 32
CHAPTER 5 ............................................................................................................................................ 33
5.0INTRODUCTION ............................................................................................................................ 34
5.2 FEATURE MATCHING................................................................................................................... 34
5.2.1 VECTOR QUANTIZATION ...................................................................................................... 34
5.2.1.1 OPTIMIZATION USING LBG ALGORITHM. ......................................................................... 35
5.3FAST FOURIER TRANSFORM..................................................................................................... 36
5.4 Dynamic Time Warping ........................................................................................................... 38
5.4.1.2Speaker identification ............................................................................................................ 40
5.5.HIDDEN MARKOV MODELS(HMM) ......................................................................................... 40
Type equation here.............................................................................................................................. 41
5.6.INTRODUCTION ........................................................................................................................... 43
5.7.Speech analysis ........................................................................................................................... 44
5.8MODEL DESCRIPTION................................................................................................................... 44
MAXIMUM LIKELIHOOD PARAMETER ESTIMATION ..................................................................... 45
5.9SPEAKER IDENTIFICATION ............................................................................................................ 46
CHAPTER 6 ............................................................................................................................................ 48
6.0RESULTS ........................................................................................................................................... 48
6.1WHEN ALL VALID SPEAKERS ARE CONSIDERED ........................................................................... 48
6.2 WHEN THERE IS AN IMPOSTER IN PLACE OF SPEAKER 4 ............................................................ 48
6.3EUCLIDEAN DISTANCES BETWEEN THE CODEBOOKS OF SPEAKERS ............................................ 48
6.4CHALLENGES FACED ................................................................................................................. 58
6.5CONCLUSION................................................................................................................................ 58
6.7RECOMMENDATIONS .................................................................................................................. 59
REFERENCES .......................................................................................................................................... 59

SUPERVISOR:DR P MANYERE
2

Voice recognition

ABSTRACT
Voice recognition is the process of validating a users claimed identity using features from
his/her voice.This has many applications in forensic speaker recognition ,authentication and
many other fields .Speech is divided into identification and verification. Identification is the
process of finding out which registered speaker has given an utterance whilst speaker
verification is the process of making a decision based on the available information.The
process of speaker recognition consists of feature extraction and feature matching.Feature
extraction is the process where we extract a small amount of the data from the voice signal
that can be used to represent each speaker.Feature matching involves identification of the
unkown speaker.My project consists of pre-processing signals(s.1wav to s.8wav) then
passing them through a window function calculating the short Fourier Transform FFt,
extracting its features and matching it with the stored template.Methods used for feature
extraction are Cepstral Coefficient calculation and Mel frequency cepstral
Coefficients(MFCC) and the ones used for feature matching are Gaussian Mixture
Models(GMM), Dynamic Time Warping(DTW) and VQLBG(Vector Quantization Via Linde Buizo Gray) etc

Voice recognition

ACKNOWLEDGEMENT
The success of this project hinge mainly on the inspiration and advice of many others.I would like to
express my deep gratitude towards DR. P.Manyere who has been a guiding force through his
supervision, valued feedback and encouragement for the duration of the project.Next I want to
express my appreciation to the people who have been contributing in the successful
accomplishment of this project directly or indirectly.Finally I would like to thank God for the strength
he gave me throughout the year and the love he put in my wonderful family members who have
been supportive socially ,spiritually and financially. I am grateful and thankful of this tremendous
help.

Voice recognition

CHAPTER 1

1.0 INTRODUCTION
The purpose of this project is to design and simulate a system in MATLAB that should be able to
identify an individual who is speaking through analysing the spectral contents of their voice.This
involves the digital voice analysis and recognition.The discussion starts with the theoretical
background then it will explain the problem statement ,justifications, aims and objectives,
Methodology and finally results.

1.1BACKGROUND
Long back people used to live in caves but with the advent of industrialisation many began to
build their homes. This meant that many sectors grew rapidly with marked improvements in the
home automation mainly areas such as security,culture,leisure,comfort,energy savings,
management and economic activities. As a result of this ,Speaker or voice recognition has
improved over the years. Speech has been one of the oldest and natural means of information
exchange between human beings, and for years people have tried to develop machines that can
understand and produce speech naturally.
Voice recognition is a biometric modality that uses an individuals voice for recognition
purposes. It depends on features influenced by physical structure of an individual vocal tract
and the behavioural traits of an individual.So the idea is validating a users identity using
characteristics extracted from their voices.This technique can be used for
authentication,surveillance,banking by telephone,telephone shopping,data base access
services,information services,voice mail,security control and many other areas.
Voice recognition can be classified into identification and verification. Speaker identification is
the process of finding which registered speaker provides a given utterance whilst speaker
verification is making a decision to accept or reject the identity claim.The process of speaker
recognition consist of two modules namely :- feature extraction and feature matching.Feature
extraction is the process in which we extract data from the voice signal which will represent
each speaker later. Feature matching involves identification of the unknown speaker by
comparing the extracted features from his/her voice input with the ones from a set of known
speakers.
The proposed work consists of determining what constitutes a voice, developing a code that
record and recognises an individual, developing a code for the analysis and identification of this
individual and then finally simulating the program in MATLAB.

Voice recognition

1.2PROBLEM STATEMENT
Due to the long queues being experienced in banks,to the high alarming security concerns
and the need to improve the quality of life of the disabled,and elderly people in
Zimbabwe,there is needto develop a digital system that willidentify individuals by voice and
respond accordingly.

1.3JUSTIFICATION OF THE PROJECT

1.4OBJECTIVES
The objectives of this research are:
Determine what constitutes a voice.
Critically review literature related to voice recognition.
Develop a code that records and recognise an individual.
Develop a code that analyses the voice and identifies an individuals voice.
Simulate the program in Matlab.

1.5METHODOLODY
Consulting the supervisor
Research on the internet
Industrial visits
Use of the library

Voice recognition

CHAPTER 2
2.0 VOICE
2.0INTRODUCTION
Voice recognition is finding out who is speaking based on the information included in speech
waves .Upon identification the system then does the respective task as per the
system.Therefore there is a need to understand what is it that makes voice suitable for such
a task.Why is it possible to identify an individual just from the voice?.This chapter looks at
all the aspects and features of voice which include : voice biometrics, voice production,
voice qualities.

2.1 WHAT IS VOICE


According to the National institute on Deafness and Other communication
disorders(NIDCD),voice (or vocalization) is the sound produced by humans and other
vertebrates using the lungs and the vocal folds in the larynx, or voice box.Voice is generated
by airflow from the lungs as the vocal folds are brought close together.When air is pushed
past the vocal folds with sufficient pressure,the vocal folds vibrate and this process is unique
as your fingerprint.

2.2VOICE PRODUCTION

Figure 1: Diagram showing how voice is produced in the voice box.

Voice recognition

In voice production the spoken word results from three components of voice production: voiced
sound, resonance ,and articulation.

Voiced sound :It is the basic sound produced by vocal fold vibration.
Articulation :It is the process of modifying the voiced sound.The vocal tract articulators are
the tongue, soft palate ,and the lips).The articulators produce recognisable words.
Resonance: The process of amplifying and modifying voiced sound by use of vocal tract
resonators(the throat, mouth cavity, and nasal passages).The resonators produce a persons
recognizable voice.

2.3VOICE MECHANISM
Speaking involve a voice mechanism that has three subsystems .Each subsystem is composed of
different parts of the body and has specific roles in voice production

Three voice subsystems.


Subsystem
Air pressure system.

Vibratory system

Resonating System

Voice organs
Diaphragm, chest muscles,
ribs, abdominal muscles,
Lungs.
Voice box (larynx)vocal folds

Vocal tract
:throat(pharynx),oral cavity
,nasal passages.

Role in sound production


Provides and regulates air
pressure to cause vocal folds
to vibrate.
Vocal folds vibrate, changing
air pressure to sound waves
producing voiced sound
frequently described as a
buzzy sound varies pitch of
sound
Changes the buzzy sound
into a persons recognizable
voice.

Figure 2 Key function of the voice box


The key function of the voice box is to open and close the glottis(the space between the two vocal
folds).Its role in voice is to close the glottis and adjust vocal fold tension
Key components of the voice project

Muscles
Nerves
Vocal folds.

2.4VOICE CHARACTERISTICS
Human speech is greatly determined by the affective state of the speaker,such as sadness
,happiness,fear,anger,aggression,lack of energy,or drowsiness.
8

Voice recognition

Speech characteristics can be roughly described by a few major features namely:

Speech flow: Speech flow is the rate or pace at which utterances are produced as well as the
number and duration of temporary breaks in speaking.
Loudness:Loudness reflects the amount of energy associated with the articulation of
utterances and, when regarded as a time-varying quantity, the speakers dynamic
expressiveness.
Intonation:It is the manner of producing utterances with respect to rise and fall of pitch, and
leads to tonal shifts in either direction of the speakers mean vocal pitch
Intensity of overtones:Overtones are the higher tones which faintly accompany a
fundamental tone, thus by being responsible for the tonal diversity of sounds.

2.5VOICE QUALITIES
Voice qualities are as distinctive as our faces-no two are exactly the same.Traits that make our voice
so unique can be formed into two categories ; fundamental frequency(high and low) and intensity
(loud or soft).Other attributes fall into vocal qualities.
If we are to create an equation for an individuals unique voice it would be:
Voice Quality=vocal tract configuration + laryngel anatomy + learned component
The shape of an individuals vocal tract is partly genetic, partly learned.Necks are of different sizes
hence pharynx maybe narrow or wide.These attributes ae genetically determined except for
configurations due to trauma and disease.
Similarly,Laryngel anatomy is partially determined at birth: the length of ones vocal folds is
determined by genes.The general hydration of ones vocal fold tissues or mascular agility of the
muscles can be controlled by vocal health and training.
The learned part of the equation can be referred to as vocal hobbits. These would be items such as
rhythm and rate of speech and vowel pronunciation.

2.6 VOICE BIOMETRICS


A biometric authentication system works on the principle of verification and identification. In the
verification stage the system compares the biometric record with one already stored in the database
for the identity in order to verify the claimed identity and in the identification system, the input is
just a biometric record. Where the system is now required to look for the biometric data most
similar to an input query biometric data and must decide if both of them belong to the same
person.It has methods for recognising humans based on physical or behavioural traits .Biometrics
can be divided into two classes:

Behavioural
1. Voice
2. Gait
3. Rhythm
9

Voice recognition

Physiological
1. D.N.A
2. Palm print
3. Finger print
4. Face recognition
5. Hand geometry
6. Iris recognition

2.7 BIOMETRIC SYSTEM

Figure 3: a diagram showing the biometric system

Speaker recognition system.

Identification

Verification

Compare speech with a set of


known voices.

Determine if speech matches a


particular known voice.

Figure 4.The two fundamental tasks of speaker recognition which are identification and verification.
10

Voice recognition

Chapter 3
Principles of speaker recognition.

Figure 5 :Diagram showing the speaker recognition system


3.0 INTRODUCTION
Speaker recognition is made up of two intergral processes :verification and identification.These two
form the core of the recognition process.This chapter starts by defining the process of speaker
recognition then develops into the verification and identification stage.Speaker identification is
further divided into text dependent and text independent speaker identification.This project focuses
on text dependent speaker identification.Many other terms will be fully explained in this chapter
and a summary will be given at the end of the chapter.

Speaker recognition is a biometric system that perfoms the computing task of validating a users
claimed identity using the characteristic features extracted from their speech samples.It can be
classified as follows.

11

Voice recognition

Speaker
recognition

Text independent

Open set

Identification

Vs

Vs

Vs

Text independent

Close set

verification

cl

Figure 6:Diagram showing the three components of a speaker recognition system.


3.2.1
Text Dependent vs Text Independent
This is a form of classifying speaker recognition systems.It is based on the text uttered by the
speaker during the identification process.
1.Text Dependent: This occurs when the test utterance is the same as that in the training
phase.The test speaker is knowledgeable about the system.
2.Text Independent :This is when the speaker does not have any knowledge about the contents
of the training phase and can say anything.
In my project I have used text dependent.

3.1 OPEN SET VS CLOSED SET


Speaker recognition can be referred to as Open set or Clossed set speaker recognition.This
category is based on the set of trained speakers in a system.
1.Open Set:An open set system is one that has a number of speakers greater than one.This
implies that it can take any number of trained speakers.
2.Closed set:It has a fixed number of users registered to the system.
In my project I have used a closed set of trained speakers.

3.2 IDENTIFICATION VS VERIFICATION

a) Identification:It is a means of proving a persons identity and in this case we will be


determining the speaker who would have provided a given utterance.
12

Voice recognition

Figure 8 :Diagram showing the identification stage


2.Verification: It is the process of verifying the claimed identity of the speaker based on the
voiceprint.

Figure 9:Diagram showing the verification stage

13

Voice recognition

3.3 MODULES
The two main modules of speaker recognition are feature extraction and feature matching

Feature extraction:
It is the first and most important part of speech recognition as it distinguishes one speech
from the other.Feature extraction also converts speech waveforms into a set of feature
vectors used for analysis.

Feature matching:This process follows feature extraction and it matches the stored
template with the features extracted from the input speech signal.

14

Voice recognition

CHAPTER 4
FEATURE EXTRACTION

4.0 INTRODUCTION
The aim of this chapter is to convert the speech waveform into a set of feature vectors for
analysis. This process if often called signal processing front end.
The speech signal is usually represented as a quasi-stationery signal for ease of analysis. A
quasi-stationery signal is a slowly timed varying signal. So in this case the signal behaves as
though it were stationery.
Short time spectral analysis is the most appropriate method to characterise a speech signal
.This comes after the discovery that when examined over a short period time( between 5
and 100ms),the characteristics of the of the speech signal will be fairly stationery and as the
time lengthens(1/5s or even more )the characteristics change.
There are a variety of possible methods for parametrically representing the speech signal for
the speaker recognition purpose. These methods are: Linear Prediction coding(LPC),Linear
Predictive Cepstral Coefficients Mel Frequency Cepstrum Coefficients(MFCC),Cepstral
Coefficients using DCT ,AMFCC ,Perceptual linear prediction(PLP),Power Spectral analysis
,Relative spectra filtering of log domain coefficients(RASTA),First Order
Derivative(DELTA),Discrete wave transform(DWT) .These methods will be thoroughly
explained and the best method will be chosen based on their advantages and
disadvantages.

Figure10 FEATURE EXTRACTION


The main objective of feature extraction is to make easy the process of recognition by summarizing
the vast amount of speech data without losing the properties which define speech. This involves
preserving the acoustic properties of speech.

15

Voice recognition

Figure 11 :Block diagram of feature extraction.

4.1 PRE-PROCESSING
Before feature extraction the signal has to be pre- processed ,that is going through various methods
of signal conditioning.(why).These tasks are:

Frame blocking
Windowing
Truncation
Short term Fourier Transform
Mel frequency warping
Cepstrum
Mel Frequency Cepstral Coefficients(MFCC)
Perceptual linear prediction(PLP).
Power Spectral analysis (FFT)
First Order Derivative(DELTA),
Discrete wave transform(DWT)
Relative spectra filtering of log domain coefficients(RASTA).

4.2.1 Frame blocking


This is when the continuous speech is blocked into fframes of N samples . The first frame
consists of the first N samples. The second frame begins M samples after the first frame, and
overlaps it by N - M samples and so on.The goal of this overlapping is to ensure that we
smooth the transition from frame to frame. This process goes on until all the speech is
accounted for within one or more frames.Ideal values for M and M are 30ms and 100ms
respectively.

16

Voice recognition

Figure 12 : Diagram showing framming

4.2.2 Windowing
After frame blocking windowing follows next.This is done so as to reduce discontinuities at the
edges of each frame.The idea will be to reduce spectral discontinuities using a window at the start
and end of each frame. Given that the windowing function is defined as w(n), 0< n<N-1 , and N
stands for the total samples in each frame , then the resulting signal will be;

01

y(n)=x(n)w(n).

In general hamming windows are used and they have the form:

() = 0.54 0.46 cos (1) ,

0 1

17

Voice recognition

Figure 13:Diagram showing the hamming window

4.2.3 Truncation
The set value of sampling frequency of a wavread command is 44100Hz.Recording an audio clip for
twos seconds would mean that the nu,ber of samples is ninety thousand which woulb be a lot to
handle.Therefore there is need to truncate a signal by selecting a particular threshold value at the
same time traversing the time axis in the positive direction.To find for the end we can repeat the
algorithm above in the negative direction.

18

Voice recognition

4.2.4 SHORT TERM FAST FOURIER TRANSFORM


The fast fourier transform changes N samples from the time domain to the frequency
domainIt is the one that executes the discrete wave fourier transform, and is given by N
samples {xn}, as follow:

Figure14: diagram showing the short term fast Fourier transform

2/
= 1
,
=0

k=0,1,2,.,N-1

Where are complex numbers and we take only their abosolute values.
The resulting sequence { } is interpreted as follows: positive frequencies 0 < /2

Corresponding to 0 21. While negative frequencies to 2 < < 0 corespond to

1. stands for sampling frequency.What results after this is called a


periodogram.
2+1

19

Voice recognition

A spectrum or a periodogram is what we obtain after the FFT.

4.2.4.1Disadvantages of FFT

Fast fourier transform is not suitabable for the for the analysis of non stationery
signals because it provides only the frequency information of a signal and does not
provide the information about the time which the frequency is present.

4.2.5Mel Frequency Warping


At this stage the Continuos speech has been subjected to windowing ,frame blocking and FFT
resulting in a periodogram.
Psychophysical studies have concluded that that human perception of frequency contents of
speech signals does not follow a linear scale.The scaling is linear up to 1khz and logarithmic
above 1khz.Thus for each tone with an actual frequency,f, measured in Herts a particular
pitch is measured on a scale called the mel-scale (melody scale).The mel frequency scale is a
linear and logarithmic spacing frequency below and above 1khz respectively.As a reference,
a pitch of a 1khz tone which is 40dB above the perceptual hearing threshold is defined as
1000mels.
Therefore the following approximate formula can be used to compute the mels for a given
frequency:

20

Voice recognition

Figure15: mel scale filter bank

The mel scale has a band pass filter bank which that is spaced uniformly on the mel scale as
shown in figure above.In the band pass filter,the spacing and bandwidth are determined by
the mel frequency interval.The filter has a triangular frequency response and the mel
spectrum coefficients K are usually chosen to be 20. Note that this filter bank is applied in
the frequency domain, thus it simply amounts to applying the triangle-shape windows as in
the Figure 4 to the spectrum. A mel-wrapping filter bank can be taken as for each filter a
histogram bin in the frequency domain.A histogram bin is were bins have overlap.

21

Voice recognition

4.2.6CEPSTRUM
The name cepstrum comes from spectrum,and this was achieved by reversing the first four
letters of spectrum.By definition cepstrum is the result of taking the inverse fourier transform
of the log of the assumed spectrum of a signal.There are different types of cepstrums and
amongst them are complex cepstrum,real cepstrum,power cepstrum,and phase cepstrum.

The real cepstrum uses the logarithmic function in order to define the real values.It
utilises the maginitude of the information.
Complex cepstrum uses the logarithmic function to define the complex values.It
keeps the information about the magnitude and phase of the initial spectrum.
Cepstrum can be expressed as Cepstrum of signal=FT(log(FT(the signal))+j2iim)
where m is the interger part required to unwrap the angle.
Algorithmically we can say Signal-FT-phase unwrapping-FT-Cepstrum.

MFCC can be calculated as follows:

1
2

( )

=
=1 (log ) cos [

],

= 1,2, .

The pipeline from signal to cepstrum is a s follows


Signal

Spectrum

Fourier transform

Logarithm

Cepstrum

Discrete Fourier transform

Figure 16: Diagram showing the pipeline of the signal to cepstrum.

4.2.7 LINEAR PREDICTIVE ANALYSIS.


Linear predictive analysis is a method that calculates the power spectrum of a signal.It is
widely used as a format estimation technique.Digital signals are compressed for efficient
transmission ,sorage and optimum utilization of channels on wireless media.This is achieved
by use of LPC.

22

Voice recognition

Speech signal

Speech analysis Filter

Residual Error.

Figure 17: Diagram showing speech analysis filter.


The residual error and speech parameters are then transferred to generate the original
signal.A parametric model is calculated based on least mean square error theory and this is
reffered to as Linear Prediction(LP).The obtained LPC coefficients describe the
formats.Format frequencies are the frequencies at which resonant peaks occur.Thus by
applying the method of LPC,the position of the formats in a speech signal are evaluated by
calculating the linear predictive coefficients over a sliding windowand determining the
peaks in the spectrum of the resulting LP filter.

4.2.8Linear Predictive Coding Coefficients(LPCC)


Linear Predictive coding coefficients (LPCC) is based on the auto-correlation method.LPCC is
an old algorithm which worked at a low bit rate and before new methods it tried its best to
mimic the human speech.Auto-correlation technique is based on finding the correlation
between the signal and itself by auto correlating each frame of the windowed signal .The
following equation can be used.

() =
= ()( );
Where
Nw is the length of the window.
Sw is the windowed segment.
The diagram below illustrates the process of LPCC.

I
I
i
k
s
d
i

23

Voice recognition

Input speech.

LPC analysis

Pre-Emphasis.

Frame Blocking.

Auto-Correlation
analysis.

Hamming Window

Cepstrum Analysis.

Figure 18: block diagram of the LPCC system.

4.2.8.1Advantages of LPCC
Resources required are low.
It is easy to implement.
It has a high popularity.
4.2.8.2Disadvantages of LPCC
Number of speakers:Single speaker only
Number of languages:single language
Vocabulary size:small to moderate below the three hundred word mark.
Not accurate for long sentences or words.

24

Voice recognition

4.2.9PERCEPTUAL LINEAR PREDICTIVE (PLP)


Perceptual Linear Predictive was developed by Hermansky.PLP is a method that models human
speech basing on the concept of of sychophysics of hearing.PLP works in the same way as LPC only
that its spectra match characteristics of human auditory system.

Critical Band analysis.

Equal Loudness.

Intensity Loudness.

Figure 19
The main Perceptual aspects are :Critical band resolution curve ,Intensity loudness power- law
relation and Equal loudness curve .All these are approximated by PLP.
given by

() = (()) + (())
The bank Frequency is given by

() = 6ln[1200 + [(1200) + 1]

0.5

The above is achieved after applying a frequency warping into the Bark filter ,the first step being
conversion from Frequency to bark which better represents the human hearing resolution in
frequency.The power spectrum of the simulated critical band masking curve convolutes with the
auditory warped spectrum.This is done so as to simulate the critical band integration of the human
hearing.The bark filter bank constitutes frequency warping ,smoothing and sampling intergrated
together.After the Bark filter is the Equal Loudness pre emphasis weight.It weighs the filter bank
outputs to simulate the sensitivity of hearing.The equalised values are then transformed by the
power law,this is achieved by raising each power of 0.33.The auditory warped line that results is
further processed by (LP).

25

Voice recognition

4.2 Relative spectra filtering of log domain coefficients(RASTA PLP)

Pre-emphasis
Input speech

Frammin

DFT

Critical Bark
analysis

windowing

Log( )

Rasta filtering

Equal loudness

Intensity
Inverse log

Loudness
Powerlaw

Auto-regressive
modelling

Cepstral Domain
Transfer

Rasta PLP

Figure 20: A diagram showing RASTA PLP

26

Voice recognition

4.2.9.0Advantages of RASTA PLP


Number of speakers:Multi speaker
Number of languages:Multi languages
Vocabulary size:Moderate to large

4.2.9.1Disadvantages of RASTA PLP


Resources required :Moderate to high
Popularity Moderate to low
Ease of implementation:Moderate to hard.

4.3.0 Mel Frequency Cepstrum Coefficients.(MFCC)


This is the most widely used method for extracting spectral features. This is done by calculating Mel
Frequency Cepstral Coefficients(MFCC).It is based on the Frequency domain musing the Mel-Scale
which also relies on the human ear scale.
In short,We first derive the real cepstral of a windowed short time signal from the Fast Fourier
transform of the signal and represent it.MFCC differs from the real Cepstral in that the latter uses a
non linear frequency scale and it approximates the behaviour of the auditory system. Being an audio
feature extraction technique,MFCC extracts parameters from the speech that are similar to the ones
used by humans for hearing and simultaneously de-emphasizing all the other information. MFCC
works by firstly dividing the speech into time frames consisting an imperious number of
samples.Each time frame is then Windowed with a Hamming window to eliminate discontinuities at
the edges.
The filter coeffients w(n) of a hamming window of length(n) are given by
2
() = 0.54 0.46 cos (
);1 1

Where N is the total number of samples.


n is the current sample.
After windowing, we take the fast fourier Transform for each frame to extract frequency
components of a signal in the time domain.Fft speeds up the processing.

27

Voice recognition

Speech signal

Pre-emphasis
Framing
windowing

FFT

Mel Filter Bank FCC

Log( )

DCT/FFT

Mel Cepstrum

Figure 21

MFCC DERIVATION

4.3.1.0Advantages of MFCC
Resources required are low to moderate
28

Voice recognition

MFCC approximates the human system response more closely this is a result of
positioning the frequency bands logarithmically.
Popularity is High
Ease of implementation: Easy to moderate
Number of speakers: Multi speaker
Vocabulary size: Moderate to large

4.3.0.0Disadvantages of MFCC
MFCC values are not very robust when there is additive noise. Hence it is the norm to
normalise their values in speech recognition systems to reduce the influence of noise.

4.3.1DTW
Speech is a non stationery signal and has short high frequency bursts and long quasi elements.This
has the effect that the Fourier transform will not be the most suitable method for analysis since it
only provides the frequency components of a signal and does not provide the information about the
time which the frequency is present. This leaves us with the Discrete wavelet Transform being the
most appropriate as it is a flexible time frequency window and deals much better with non
stationery signals.
The wavelet breaks down signals over translated and dilated mother wavelets. Mother wavelets are
defined as the time functions with fast decay and finite energy. The many types of the single wavelet
are orthogonal to each other.
The continuous wavelet transform is given by

(, ) =

1
() (

Where () is themother wavelet.


a is the scaling factor.
b is translation parameter.
If a=2j and b= 2kj it gives the DWT
DWT needs two sets of functions.
These are the wavelet functions .
1

() =

[]2

(2 )

=0

And the scaling function are


29

Voice recognition

=1

() = =0 [] 2 (2t-n)
Where
(t) is the scaling function.
h[n] is the impulse response to a low pass filter.
g[n] is the impulse response of a high pass filter.
The wavelet functions and scaling are implemented using a pair of filters h[n] and g[n].The two
filters are quadrature mirror filters and they satisfy the property g[n]=(-1)(n-1)h(1-n).The input
signal is subjected to a low pass filtering to give the approximate components components and high
pass filtered for the delta components of the input speech.
Next is dyadic decomposition ,this is when we decompose the approximate signal low pass filter and
high pass filtered. This is done so as to get the approximate and detail components. Dyadic
decomposition separates the input signal bandwidth. Into the logarithmic set of bandwidths, whilst
the uniform decomposition will divide it into sets of uniform bandwidth.
DWT is a good technique in solving frequencies very well as with speech, high frequencies will be
briefly at the beginning of a sound while lower frequencies are presented after for a quite long
period.The discrete wavelet parameter has information about different frequency scales and this is
useful as it provides speech information of corresponding frequency band.

30

Voice recognition

Speech signal

Pre processing
Framming
windowing

LPC

LPC

LPC

LPC

Concatenation.

DWLP

Figure 22 : diagram showing how we obtain DWLP

31

Voice recognition

Advantages of DTW
1. There is localisation simultaneously both in time and frequency domain.
2. DWT give better recognition accuracy than MFCC and LPC.
3. It has better energy compaction, hence it has been used in feature extraction in place of Mel
filtered sub band energies.
4. DTW has better time resolution as compared to Fourier transform.
5. A wavelet transform can be used to breakdown a signal into component wavelets.
6. It is very fast computationally.
7. DWT can model the details of unvoiced sound portions effectively.
8. Wavelet Transform is able to separate the fine details in a signal

Disadavantages of DWT
The Continuous wavelet transform contains high redundancy while analysing the signals.
The cost of computing DWT may be higher than that need in DCT and other methods.
It needs longer compression time.

SUMMARY

Through implementing the above procedures a set of (MFCC) meaning Mel Frequency
Cepstrum Coefficients is computed for each frame.
Each input speech utterance is computed into a sequence of acoustic vectors.

32

Voice recognition

CHAPTER 5

33

Voice recognition

5.0INTRODUCTION
Spectal analysis refers to the techniques of estimating the power of the frequency components of a
signal.Many naturally occurring events are oscillating in nature and have frequency dependency for
example speech signal and weather components.
Speaker recognition is made simpler by a branch of engineering which deals with pattern
recognition.The main purpose of pattern recognition is to intelligently put the objects of interest into
a number of classes.From the input speech we extract the acoustic vectors which form
patterns(objects) and in this case the classes of interests refer to the speaker such that the
classification procedure is applied on extracted features ,it can also be termed feature matching.
In addition, if we have a set of already known individual classes,then the challenge can be scaled
down to supervised pattern recognition.These patterns make up the training set and classification
algorithm is derived from them.The other remaining patterns then comprise the test set.
This chapter looks at the many methods that can be used to carry out feature matching.These
methods are MFCC approach,FFT approach,Vector Quantization,DTW,GMM,HMM.All these
methods will be thoroughly explained and a final decision reached as to which approach to use
based on the advantages and disadvantages of each.

5.2 FEATURE MATCHING.


The following methods are used for feature matching

Vector Quantization(VQ)
Fast Fourier Transform(FFT)
Dynamic Time Warping(DTW)
Gaussian Modelling Models(GMM)
Hidden Markov Modells(HMM)

5.2.1 VECTOR QUANTIZATION


The branch of Electrical engineering that is responsible for minimising the bits necessary to transmit
signals while maintaining acceptable signal fidelity is called data compression or in our case speech
compression or speech coding.The conversion of an analogue signal(continuous amplitude,
continuous time) into a digital signal(discrete amplitude,discrete time) is made up of two parts
namely sampling and quantization.Sampling is the conversion of a continuous time time signal by
measuring the signal value at regular time intervalz and Quantization is the conversion of a
continuous amplitude into one of a set of discrete amplitude.
Before describing vector quantization there is a need to understand what scalar quantization
is.Scalar quantization is when each of a sequence of signal values is quantized separately whilst
Vector Quantization(block quantization) is when the sequence of signal values is quantised jointly as
a single vector and this process is popularly known as VQ.VQ involves mapping vectors from a large
set of feature vectors of a particular user to a finite set of feature vectors that represents the
centroid of the distribution. Each area is called a cluster and can be represented by a codeword
which is its centre and a collection of codewords makes up a codebook.Vector Quantization is used

34

Voice recognition

as it is impossible to represent to represent every single feature vector in the space that is generated
from the training utterance of the corresponding speaker.

In figure . Below ,two speakers and two dimensions are represented.The circles show the acoustic
vectors for two speaker one and the triangles for speaker two.During the training phase,a speaker
VQ codebook is made for each known speaker by clustering the training acoustic vectors.The
codewords which result are shown in block circles and block triangles for speaker one and two
respectively.So the main principle here is to vector quantize an input utterance using each trained
codebook and this process makes up the recognition phase.To determine or identify a speaker from
the input utterance ,the total VQ distortion is calculated and the the speaker with the smallest total
distortion is the one who is identified.

Figure 23 Conceptual diagram illustrating vector quantization codebook formation.


One speaker can be discriminated from another based of the location of centroids.
(Adapted from Song et al., 1987)
The major advantages of VQ are:

Discrete representation of speech signals.


It needs less storage space
Reduced calculations.In speech recognition,a major computation is the determination of
spectral similarity between a pair of vectors.Based on the VQ representation this often
reduced a table lookup of similarities between pairs of codebook vectors

5.2.1.1 OPTIMIZATION USING LBG ALGORITHM.


First is the enrolment session then comes the feature extraction.The feature vectors extracted from
the the input speech of each speaker give a set of training vectors for that speaker.The next stage
will be to buid a speaker specific VQ codebook for each speaker using the training vectors.This is
35

Voice recognition

done by an algorithm called LBG[LIndi,Buzo and Gray,1980].The LBG algorithm clusters a set of L
training vectors into M codebook vectors.

5.3FAST FOURIER TRANSFORM


We start with a time domain signal where the amplitude of the signal varies with time.The function
FFT(Fast Fourier Transform) is used to convert the signal from the time domain to the frequency
domain using matlab.It is a fast algorithm to compute the discrete fourier transform.From the the
output of fft we obtain the magnitude response and the phase respone.The latter tells us the
strength of the frequency response whilst phase tells us about how the frequency componets align
in time.

Figure 5.3

36

Voice recognition

Figure 24
37

Voice recognition

5.3.1Challenges of Fast Fourier Transform.

It produces complex output of real valued input and this is not trivial to deal with
With fft we would need to compute the magnitude and the phase information
We also have to find the negative and the positive frequency components
Signal power information is unavailable.

5.4 Dynamic Time Warping


Euclidian distance is quiet oftenly used for tasks where we have to compare one sequence to
another in the time series,as the time series is the frequently form of data presentation that is
used in the firld of science.There comes a challenge when the two sequence have nearly the
same overall component shapes and they do not lie up in the x- axis.Any distance which
aligns the Ith part of one time series to the ith part of another will fail.An elastic(Non linear)
method produces a more similar measure and this allows similar shapes to match even if they
are out of phasae in the time domain.Dynamic Time Warping is a an alogorithm that will
exploit non linear modelling to get efficient warping at one time series with respect to
another.DTW is a method to find an optical alignment separating two given time independent
sequences under certain restrictions.

Take two time series X and Y of length n and m respectively:


X = 1 , 2.. , ,
Y =1 , 2.. .
Inorder to align the two sequences,DTW is employed by constructing a n by m matrix where
the (ith,jth) element of the matrix contains the distance d( , ) between the points and
.Ecludian distance used is given by d( , )= ( )^2.Each matrix component (i ,j)
correspond to the alignment between points and .A warping path W is defined as a
continuous set of matrix elements which represents a mapping between X and Y.
= 1, 2 , . . ,..

max(, ) + 1

W is subject to several constrains which are:


1. Continuity: Given = (, ) then 1 =(a,b) where a-a 1.This limits
the permitted steps in the warping path to adjacent cells.
2. Boundary conditions: 1, = (1,1) and = (m, n) this needs the warping path
to start and finish in diagonally opposite corners.
3. Monotonicity:Given = (, ) and then 1=(a,b) .It forces the points in
W to be monotonically spaced in time.

38

Voice recognition

The dynamic path which is given by :

(, ) = min{

=1

Where:
The K in the denominator stands for the compensation as the warping paths may have
different lengths.The above path is found efficiently using Dynamic programming.

Dynamic Programming
It determines the maximum path providing the best warp between a given speaker and a test
utterance.A cumulative distance matrix is generated from the eclidian distances already
obtained.
The process of generating the cumulative matrix is a s follows:
1. Use the constellation given below. P1,P2 and P3 show the three different paths.The
best path is the one which has the least Eclidian distance.

P1

P2
P3
2. Take theinitial condition g(1,1)=d(1,1) whered and g are Eclidian matrix and
cumulative distance matrix respectively.
3. Calculate the first row g(I,1)=g(i-1,1)+d(I,1).
4. Compute the 1st column g(1,j)=g(1,j)+d(1,j).
5. Proceed to the 2nd row g(I,2)=min(g(I,1),g(i-1,1),g(i-1,2))+d(i,2).
6. Continue from left to right and from bottom to top with the rest of the grid
g(i,j)=min(g(I,j-1),g(i-1,j-1),g(i-1,j-1),g(i-1,j))+d(I,j).
7. Find out the value of g(n,m).

5.4.1ADVANTAGES OF DTW

It is simple
It is efficient
It is fast
Simple and easy hardware.

5.4.1.1DISADVANTAGES OF DTW

It does not take into account vocal tract information of a particular user
39

Voice recognition

5.4.1.2Speaker identification
Consider a set of unknown speakers with known feature vectors. When asked to identify
an unknown speaker assuming that the speaker is one whose voice sample we already have,
the first step would be feature extraction .when we obtain the feature vectors we then try to
warp the unknown vectors with respect to a reference speaker.We follow dynamic
programming and calculate g(n,m).This procedure is repeated for all available speakers and
the least value of g(n,m) gives the identity.

5.5.HIDDEN MARKOV MODELS(HMM)


Hidden markov models mostly used in speech recognition are narks model.HMM
recognisers are implemented by both the sub band recognisers and the baseline.In HMM we
are mainly interested in finding the probability of feature vectores which is computed using
the transition probabilities between states and the observation probabilities of feature vectors
of a given state.
In finding the probability of speech feature vectors generated from an HMMthere are
threeproblems which must be solved.The first problem is evaluation.Evaluation finds the
probability that a sequence of visible stateswas generated by the model M.Second is the
decoding stage.It finds state sequence that maximises the probability of observed
sequence.The last stage is training which adjusts model parametres to maximise probability
of observed sequence.In this step we are determining the reference speaker models for all
speakers.

40

Voice recognition

Type equation here.


Figure 25

5.5.1.0

HMM describes a two stage stochastic process.Markov chain forms the first stage and in the
second stage for every point in time t an output is made.The following steps describe an
HMM for discrete symbol.
N,the number of hidden states in the model.We label the states as N={1,2..,N}, and
denote the state at time t and
M,the number of distinct observation symbols per state.V={1 , 2 . . }
State the transition probability distribution, = { }, where = [+1 = =i]
The observe probability distribution in state j,B={ ()}, where () = [ =
= ], 1
The initial state distribution , = [ = ], 1
Training the HMMs
Each speaker s in the database must have a corresponding HMM , where the model
parameters = (, , ) maximise the likelihood of the training dataset.

The following diagram illustrate the steps for estimating the model parameters.

41

Voice recognition

Speech signal

Feature
Extraction

MFCC

Vector
Quantizer

Observation Sequence

Forward
backward
algorithm

HMM
= (, , )

Figure 26: diagram showing HMM structure


As described before the input speech is converted into MFCC the feature vectors are
quantized into observation sequences. Finally the models parameters are estimated from the
observation sequence using the forward -Backward Algorithm.

GAUSSIAN MIXTURE MODELS.

42

Voice recognition

Figure 26: Gausian mixture model

5.6.INTRODUCTION
Gausian Mixture Model is a non parametric method for speaker recognition.Feature vectores
in d-dimension after clustering resemble Gaussian distribution This implies that each cluster
can be seen as a probability distribution and features corresponding to the clusteres can be
best represented by their probability values.Speech analysis is done first (Feature
extraction).Then Gaussian Mixture Speaker Model and its parameterized is analysed. Gausian
mixture density which is employed in speaker recognition is motivated by two facts.

The individual component Gausian classes are set to represent acollection of


acoustic classes.These acoustic classes the represent vocal tract configurations.
A Gaussian mixture density gives a smooth approximation to the underlying long
term single distribution derived from utterances.

The following diagram shows the Gausian model.

43

Voice recognition

Figure 27 Gausian mixture model

5.7.Speech analysis
Linear Predictive coefficients cepstral and reflection coefficients have been employed most
frequently in speaker recognition systems.However these have failed as a result of the effects
of noise.Recent studies have pointed out that directly computed filter bank features are more
robust for noisy speech recognition.So the magnitude spectrum from a quasi-stationery signal
is first pre-emphasised and processed by mel-scale filter bank.The energy filter outputs are
then cosine transformed to give cepstral coefficients.

5.8MODEL DESCRIPTION
A gausian mixture density is given by

( |) =
=1 ( )
Where

is the vector p.
is the mixture weight of the ith component.
44

Voice recognition

() is the probability distribution of the ith component of the feature space.


Each single component density is a D-variat.

Gausian expression is given by:

()

1
2 2 | |
2

exp{1/2(
) 1

)}
(

Where u represents the mean of ith component and


The mixture weights satisfy the constrain that

is the covariance matrix

=1 = 1

We represent the Gausian Mixture density by the mixture weights, of all component
densities, mean vectors, covariance matrices and mean vectors.
The different parameters are shown as follows:

= { , ,

} = 1, .

To identify a speaker.Each speaker is represented by a GMM and is represented by his/her .


This information can be shown by a diagram as follows:

MAXIMUM LIKELIHOOD PARAMETER ESTIMATION


The obtained feature vectors are classified imto different Gaussian components.For starters
we do not know covariance and mean components present so we cannot have the right
classification of the vectors.
To fully maximise the classification process,for a gven set of feature vectors,an algorithm is
developed known as Expectation Maximisation.
The algorithm works as follows:
1. Assume initial values of , ,
2. Calculate values of mean,covariance, and mixture weights iterativelyusing formulae
so that probability of classification of a set of T feature vectors is fully maximised.
Formulae:

45

Voice recognition

Mean

}
=1 { |,

, ))
=1 ( |

Mixture weights :
= 1/ =1 (|
,
)
Variances:

2 =

)2
=1 ( |,

)
=1 ( |,

Where (| , ) is referred to as the posteriori probability and is given by the expression:


)
(

p(| , ) =

=1 (

5.9SPEAKER IDENTIFICATION
A set of models models is obtained after modelling each users Gausian mixture models.Each
individual Model represents Gausian distribution components present,
Given k number of speakers:
It is represented by = {1 , 2 , 3 , . }.
We finally end up with the speaker model , with maximum probability for a given test
utterance.
We can represent it mathematically as

( | ) Pr( )
= arg 1 ( |) = arg 1

()

5.9.1ADVANTAGES OF GMM

It is highly efficient(100% efficient)

46

Voice recognition

5.92DISADVANTAGES OF GMM

Efficiency degrades a bit when the number of components used becomes high.

47

Voice recognition

CHAPTER 6
6.0RESULTS
In this project feature extraction was done by the Mel Frequency Cepstral
Coefficients(MFCC).The speaker was then modelled using (V.Q).A Vector quantizer code
book was generated by clustering the feature vectors of each speaker and then kept in the
speaker database.LBG was used to do the clustering and I have found out that VQ based
clustering provides a faster speaker identification process than any other method.
In this project eight speech signals corresponding to eight speakers were stored in the train
folder.The eight signals are s1.wav,s2.wav,s3.wav,s4.wav,s5.wav,s6.wav,s7.wav and
s8.wav.These were compared with the sound files in the test folder.The following results
were obtained.

6.1WHEN ALL VALID SPEAKERS ARE CONSIDERED


Speaker 1 matches with speaker 1
Speaker 2 matches with speaker 2
Speaker 3 matches with speaker 3
Speaker 4 matches with speaker 4
Speaker 5 matches with speaker 5
Speaker 6 matches with speaker 6
Speaker 7 matches with speaker 7
Speaker 8 matches with speaker 8

6.2 WHEN THERE IS AN IMPOSTER IN PLACE OF SPEAKER 4


Speaker 1 matches with speaker 1
Speaker 2 matches with speaker 2
Speaker 3 matches with speaker 3
Speaker 4 is an imposter and corresponding distance is 1.060407e+001
Speaker 5 matches with speaker 5
Speaker 6 matches with speaker 6
Speaker 7 matches with speaker 7
Speaker 8 matches with speaker 8

6.3EUCLIDEAN DISTANCES BETWEEN THE CODEBOOKS OF SPEAKERS


Distance between speaker 1 and speaker 1 is 2.582456e+000
Distance between speaker 1 and speaker 2 is 3.423658e+000
Distance between speaker 1 and speaker 3 is 6.691428e+000
Distance between speaker 1 and speaker 4 is 3.290923e+000
Distance between speaker 1 and speaker 5 is 7.227603e+000
Distance between speaker 1 and speaker 6 is 6.004165e+000
Distance between speaker 1 and speaker 7 is 6.388921e+000
Distance between speaker 1 and speaker 8 is 3.990130e+000
Distance between speaker 2 and speaker 1 is 3.644154e+000
Distance between speaker 2 and speaker 2 is 2.023527e+000
48

Voice recognition

Distance between speaker 2 and speaker 3 is 5.932640e+000


Distance between speaker 2 and speaker 4 is 3.962964e+000
Distance between speaker 2 and speaker 5 is 6.041227e+000
Distance between speaker 2 and speaker 6 is 5.033079e+000
Distance between speaker 2 and speaker 7 is 5.120361e+000
Distance between speaker 2 and speaker 8 is 4.053674e+000
Distance between speaker 3 and speaker 1 is 6.208796e+000
Distance between speaker 3 and speaker 2 is 5.631654e+000
Distance between speaker 3 and speaker 3 is 2.000804e+000
Distance between speaker 3 and speaker 4 is 5.191537e+000
Distance between speaker 3 and speaker 5 is 3.464318e+000
Distance between speaker 3 and speaker 6 is 3.608015e+000
Distance between speaker 3 and speaker 7 is 4.014857e+000
Distance between speaker 3 and speaker 8 is 4.323667e+000
Distance between speaker 4 and speaker 1 is 3.280098e+000
Distance between speaker 4 and speaker 2 is 3.713952e+000
Distance between speaker 4 and speaker 3 is 5.298161e+000
Distance between speaker 4 and speaker 4 is 2.499871e+000
Distance between speaker 4 and speaker 5 is 5.865334e+000
Distance between speaker 4 and speaker 6 is 4.805346e+000
Distance between speaker 4 and speaker 7 is 5.314957e+000
Distance between speaker 4 and speaker 8 is 3.441053e+000
Distance between speaker 5 and speaker 1 is 6.978178e+000
Distance between speaker 5 and speaker 2 is 6.136129e+000
Distance between speaker 5 and speaker 3 is 3.579665e+000
Distance between speaker 5 and speaker 4 is 5.900074e+000
Distance between speaker 5 and speaker 5 is 2.078398e+000
Distance between speaker 5 and speaker 6 is 3.537214e+000
Distance between speaker 5 and speaker 7 is 3.579846e+000
Distance between speaker 5 and speaker 8 is 5.079328e+000
Distance between speaker 6 and speaker 1 is 5.776238e+000
Distance between speaker 6 and speaker 2 is 5.380254e+000
Distance between speaker 6 and speaker 3 is 3.690566e+000
Distance between speaker 6 and speaker 4 is 4.937416e+000
Distance between speaker 6 and speaker 5 is 3.420030e+000
Distance between speaker 6 and speaker 6 is 2.990975e+000
Distance between speaker 6 and speaker 7 is 3.429637e+000
Distance between speaker 6 and speaker 8 is 4.257454e+000
Distance between speaker 7 and speaker 1 is 5.701679e+000
Distance between speaker 7 and speaker 2 is 4.847096e+000
Distance between speaker 7 and speaker 3 is 4.125191e+000
Distance between speaker 7 and speaker 4 is 5.077611e+000
Distance between speaker 7 and speaker 5 is 3.453576e+000
Distance between speaker 7 and speaker 6 is 2.844346e+000
49

Voice recognition

Distance between speaker 7 and speaker 7 is 1.706408e+000


Distance between speaker 7 and speaker 8 is 4.496219e+000
Distance between speaker 8 and speaker 1 is 3.791221e+000
Distance between speaker 8 and speaker 2 is 3.783202e+000
Distance between speaker 8 and speaker 3 is 3.965679e+000
Distance between speaker 8 and speaker 4 is 3.363752e+000
Distance between speaker 8 and speaker 5 is 4.687492e+000
Distance between speaker 8 and speaker 6 is 3.931799e+000
Distance between speaker 8 and speaker 7 is 4.254214e+000
Distance between speaker 8 and speaker 8 is 2.386353e+000

Figure 28 Plot of the difference in Eclidean distance of the speakers.

50

Voice recognition

Figure 29Plot of the Ecldian distance between speaker 1 and all speakers.

51

Voice recognition

Figure 30 Plot of the eclidian distances between speaker 2 and all speakers

52

Voice recognition

Figure 31 Plot of the Eclidian distance of spaker 3 and all the speakers

Figure 32Plot of the Eclidian distance of speaker 4 and all the speakers

53

Voice recognition

Figure 33 Plot of the eclidian distance between speaker 5 and all the speakers

54

Voice recognition

Figure 34 Plot of the Eclidian distance between speaker 7

and all the speakers

Figure 35 Plot of the Eclidian distance between speaker 8 and all the speakers.

55

Voice recognition

6.4Code for recording voice and analysis.


a= audiorecorder(8000,8,1);
record(a,5);
plot(b);
sound(b);
c=fft(b);
plot(abs(c));
VQ approch :
Train.m
This is the code for training the system(It is the main function)
Function code = train (traindir, n)
k = 16;
number of centroids required
for i = 1:n;
train a VQ codebook for each speaker
file = sprint ('%s%d.wav', traindir, i);
disp (file) ;
[s, fs] = wavread (file) ;
v = mfcc (s, fs) ;
Compute MFCC's
code{i} = vqlbg (v, k) ;
Train VQ codebook
end
mfcc.m(Subroutine which will be called from main)
Function r = mfcc (s, fs)
m = 100;
n = 256;
l = length(s);
nbFrame = floor((l - n) / m) + 1;
for i = 1:n
for j = 1:nbFrame
M(i, j) = s(((j - 1) * m) + i);
end
end %size(M)
h = hamming(n);
M2 = diag(h) * M;
For i = 1:nbFrameframe (:,i) = fft(M2(:, i))
M(i, j) = s(((j - 1) * m) + i);
end
end %size(M)
h = hamming(n);
M2 = diag(h) * M;
for i = 1:nbFrameframe (:,i) = fft(M2(:, i));
end t = n / 2;tmax = l / fs;m = melfb (20, n, fs) ;
n2 = 1 + floor(n / 2);
z = m * abs(frame(1:n2, :)).^2;
r = dct(log(z));
disteu.m(subroutine which will be called from the main function)
functiond = disteu (x, y)
[M, N] = size(x);
[M2, P] = size(y);
if (M ~= M2) error('Matrix dimensions do not match.'
end
d = zeros(N, P);
if (N < P)copies = zeros(1,P);
forn = 1:Nd (n,:) = sum((x(:, n+copies) - y) .^2, 1)
end
d = zeros(N, P);

56

Voice recognition

if(N < P)copies = zeros(1,P);


forn = 1:Nd(n,:) = sum((x(:, n+copies) - y) .^2, 1);
end
else copies = zeros(1,N);
for p = 1:Pd (:,p) = sum((x - y(:, p+copies)) .^2, 1)';
end
end
d = d.^0.5;
melfb.m(Subroutine called from the main)
function m = melfb (p, n, fs) f0 = 700 / fs;
fn2 = floor(n/2);
lr = log(1 + 0.5/f0) / (p+1);% convert to fft bin numbers with 0 for DC term
bl = n * (f0 * (exp([0 1 p p+1] * lr) - 1));
b1=floor(b1(1))+1;
b2 = ceil(bl(2));
b3 = floor(bl(3));
b4 = min(fn2, ceil(bl(4))) - 1;
pf = log(1 + (b1:b4)/n/f0) / lr;
fp = floor(pf);
pm = pf - fp;r = [fp(b2:b4) 1+fp(1:b3)];
c = [b2:b4 1:b3] + 1;
v = 2 * [1-pm(b2:b4) pm(1:b3)];
m = sparse(r, c, v, p, 1+fn2);
vqlbg.m(subroutine called from the main function)
Function r = vqlbg(d,k)
e = .0003;
r = mean (d, 2);
dpr = 10000;
for i = 1: log2(k) r = [r*(1+e), r*(1-e)];
while(1 == 1)z = disteu (d, r);
[m,ind] = min(z, [], 2);
t = 0;
for j = 1:2^i
r (:, j) = mean(d(:, find(ind == j)), 2
x = disteu(d(:, find(ind == j)),
r(:, j));
for q = 1:length(x)t = t + x(q);
end
end
if (((dpr - t)/t) < e)
break ;
else dpr = t;

test.m(Main function for testing )


Function test(testdir, n, code)
For k = 1:n% read test sound file of each speaker
file = sprint ('%ss%d.wav', testdir, k);
[s, fs] = wavread (file);
v = mfcc (s, fs);% Compute MFCC's
distmin = inf;
v = mfcc(s, fs); % Compute MFCC's
distmin = inf;
k1 = 0;
for l = 1:length (code) % each trained codebook, compute distortion

57

Voice recognition

d = disteu(v, code{l});
dist = sum(min(d,[],2)) / size(d,1);
if dist < distmin
distmin = dist;k1 = l;
end
end
msg = sprintf('Speaker %d matches with speaker %d', k, k1);
disp (msg)
end

6.4CHALLENGES FACED

Difficult to deal with a text independent system


Noise was a factor
Different accent of different people hence the system needed to be trained
intelligently
The recorded voice needed to be cut to precise sixze of the matched
sample.

6.5CONCLUSION
The objective of this project was to build a speaker recognition system.This was done as seen
by the simulation. Eight different voice clips were successfully matched to their respective
voices. This was possible through feature extraction by MFCC(Mel frequency Cepstrum
Coefficients).The speaker was modelled using Vector Quantization upon which a VQ
codebook will be generated by clustering the training feature vectors of each speaker .In
recognising the speaker minimal Euclidian distance was used when matching an unknown
speaker.
The Vector quantization method is faster than all the other methods and it is also efficient.

58

Voice recognition

6.7RECOMMENDATIONS
The project focused more on building a text dependent and closed set voice recognition
system as only one word was uttered.As explained before a closed set is the one where we
have a limited number of speakers.This could be improved in futureby using statistical
models like Gaussian Mixture Models (GMM) ,Hidden Markov modelsI (HMM) and
learning models like Neural Networks and other associated aspects of artificial intelligence
can all be implemented to improve the project.
This would also make the project less prone to noise and would also cater for different
accents and moods.The following arears can also be looked into
VQ code takes a long time to compute
Size of the training data
Detection used.

REFERENCES

59

Voice recognition

1 Humaid ALSheiili,Gourab Sen Gupta,Subhas Mukhopadhyay,Voice


Recognition Based Wireless Home Automation System,4th international
Conference on Mechatronics (ICOM),17-19 may
2011,KualaLumpur,malaysia,pp 1-6,2011
2 S.Furui,Digital Speech processing,Synthesis and Recognition
Newyork:Marecell Dekker 2001
3 Voice recorder Reference Design (AN 278),Silicon laboratories 2006
4 Forensic Articles website[cited 2015 15 nov],2015-1119available:www.owlineinvestigations.com/article/html
5 http/www.explainthatstuf.com/voice recognition.html.[accessed 01 nov
2015]
6 Saddik,H;Rahmaan,A,SayidiM;Text independent speaker recognition using
the Mel frequency cepstral coefficients and neural network classier IEEE
2004 page(s)63-634
7 A.Dempster,Laird and Rubin,Maximumn likelihood from incomplete data
via the EM algorithm J Ray Stat SOc,Vol 39,pp 1-38,9997
8 W,Ching and M.KNg Markov chain Models,Algorithms and application,New
York,sprnger Science +Business Media,Inc2006.
9 Zegers,P.:Speech recognition Using neural Networks.Ms Thesis.University
of Arizona ,Department of Electrical Engineering in the Graduate
college,Arizona USA(1998).
10 www.RASTA (and MFCC and inversion)in Matlab using melfcc.m and
invmelfcc.m.htm.
11 www.DSP Mini -Project speaker Recognition,htm.
12 www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
13 http//www.ifp.uiuc.edu speaker recognition
14 speaker Recognition Proposal course cct.carnell.edu.

60

You might also like