Deep Auto Encoder for Speech Emotion Recognition

SPEECH EMOTION RECOGNITION BASED ON DEEP
AUTO ENCODER
Dissertation submitted in fulfillment of the requirements for the Degree of
MASTER OF TECHNOLOGY
By
SAILESH TOMAR
Department of Electronics and Communication
JAYPEE INSTITUTE OF INFORMATION TECHNOLOGY
(Declared Deemed to be University U/S 3 of UGC Act)
A-10, SECTOR 62, NOIDA, INDIA
MAY 2017
i
CERTIFICATE
This is to certify that the work reported in the M.Tech Dissertation entitled SPEECH
EMOTION RECOGNITION BASED ON DEEP AUTO ENCODER title
submitted by SAILESH TOMAR at Jaypee Institute of Information Technology,
Noida, India, is a bonafide record of his original work carried out under my supervision. This
work has not been submitted elsewhere for any other degree or diploma.
Dr. Rajesh Kumar Dubey

Jaypee Institute of Information Technology
Noida
Date:
ii
ACKNOWLEDGEMENTS
First and foremost, I would like to thank my parents for their love, guidance and continuous
support. I would like to thank Dr. Rajesh Kumar Dubey for introducing me to the interesting
subject of speech processing and being my mentor during this project. He not only guided me
during this project, but also taught me immensely about approaching research problems in
general. I have learned a lot from his impeccable work ethics and friendly behavior. Lastly, I
would like to thank my sister, who cheers me up with his joyous energy whenever I feel tense
and exhausted.
Sailesh Tomar
Electronics and Communication Engineering Department
Jaypee Institute of Information Technology Noid
iii
TABLE OF CONTENTS
ABSTRACT (vi)
LIST OF ACRONYMS AND ABBREVIATIONS (vii)
LIST OF SYMBOLS (9)
LIST OF FIGURES (9)
LIST OF TABLES (10)
Chapter 1: SPEECH RECOGNITION BASIC

1.1 OVERVIEW...........................................................................................................................11
1.2 REASON FOR SPEECH AND PROCESS.........................................................................11
1.3 TYPES OF SPEECH RECOGNITION...............................................................................11
1.4 GRAMMARS AND VOCABULARIES..11
1.5 GENERAL TELEPHONY TERMS........12
1.6 TYPES OF SPEECH APPLICATION AND GRAMMAR TERMS12
1.7 OTHER SPEECH TERMS..13
1.8 TRADITIONAL FEATURES..13
1.9 SPEECH RECOGNITION......14
CHAPTER 2: ARTIFICAL INTELLIGENCE
2.1 APPROACHES......................................................................................................................22
2.2 TOOLS....................................................................................................................................26
2.3 LANGUAGE AND PROBLEM DEDUCTED THROUGH AI.........................................28
2.4 ADVANTAGE........................................................................................................................28
2.5 DISADVANTAGE.................................................................................................................28
2.6 APPLICATION.....................................................................................................................28
iv
CHAPTER 3: SPEECH EMOTION RECOGNITION
3.1 INTRODUCTION.29
3.2 SPEECH EMOTION FEATURE.31
3.3 DEEP AUTO ENCODER.33
3.3.1 THEORY OF DEEP AUTO ENCODER.33
3.3.2 THE TRAINING OF DEEP AUTO- ENCODER...36
CHAPTER 4: EXPERIMENT AND CONCLUSION
4.1 DESIGN OF EXPERIMENT..40
4.2 EXPERIMENT RESULTS AND ANALYSIS...43
4.3 CONCLUSION.44
REFERENCES...........................................................................................................................45
v
ABSTRACT
Good features contain important information which plays a critical role in the speech emotion
recognition. In this project, a relatively new deep learning technique is used and the phonetics
features are extracted by use of deep auto encoder. The deep auto encoder consists of the five
hidden layers. The input data is obtained by dividing the audio into short frames and each frame
speech emotion signal was transformed into wavelets, which is determined by Fourier transform.
Enhanced features are learned from deep auto encoder, and with some traditional technique such
as MFCC, LPCC etc. The result of recognition accuracy using date auto encoder is nearly 86% in
comparison to the traditional technique.
vi
LIST OF ACRONYMS AND ABBREVIATIONS
HCI Human computer interaction

SER Speech Emotion Recognition
MFCC Mel frequency cepstral coefficient
LPC Linear prediction coefficient
LSP Line spectrum pair
PLP Perceptual linear prediction cepstral
OSALPC One sided autocorrelation linear prediction coefficient
FT Fourier Transform
SMC Short coherence
LPCC Linear prediction cepstral coefficients
GMM Gaussian Mixture Models
OSALPCC One sided autocorrelation linear prediction cepstral coefficient
ZCPA Zero crossing amplitude peak
TEO Teager energy operator
EMD Empirical mode decomposition
DBN Deep belief network
DL Deep learning
AI Artificial intelligence
DAE Deep auto encoder
AE Auto encoder
SNR Signal to Noise Ratio
CASIA Chinese academy of sciences institute of automation of speech emotion database
SVM Support vector machine
BP Back propagation algorithm
DTMF Dual tone multi frequency
IVRs Interactive voice response system
ASR Automatics speech recognition
SE Speech recognition engine
vii
GU Graphical user interface
TTS Text to speech computing
VUI Voice user interface
MRCP Media resources control protocol
VXML Voice extensible markup language
ABNF Augmented backup- Naur form
GrXML Grammar XML
SISR Standard language for adding logic
LPC Linear predictive coding
viii
LIST OF FIGURES
Figure Caption Page

Number Number
1.8 Mel frequency cepstral coefficient (MFCC) 13
3.2 Block diagram of speech emotion recognition 30
3.5 Deep auto- encoder model 34
3.7 Pre- training diagram of deep auto encoder 37
9
LIST OF TABLES
4.1 All kinds of emotion recognition results based on different features.40
10
CHAPTER I
SPEECH RECOGNITION BASICS
1.1 Overview
Input method and Recognition limitation: Recognizes what you say, not what you mean, Not
artificial intelligence, Only as good as the application, Speech recognition means recognizing
specific words and voice recognition means recognizing a specific voice i.e. what was said
versus who said it.
1.2 Reason for speech and process

1. Reason for speech
More natural interactions, Conveniences, Open ended question, some difficult with DTMF
e.g. City/state and call router.
2. Process
The speech engine: Loads a list of words to be recognized, Loads audio, Analyzes audio for
distinct sounds and characteristics, Compare sounds to internal acoustic models, using word
list, Return probable matches.
1.3 Types of speech recognition

1.3.1. Speaker dependent
Works for one persons voice, Lengthy training period, Used mainly for dictation software.
Have to train recognition engine for a voice. Large vocabularies.
1.3.2. Speaker independent
Works for any number of speakers, No training required, Most common for IVRs, Limited
vocabularies, E.g. telephone application.
1.4 Grammars and vocabularies

1. Grammars
List of words to be recognized, Individuals files - Structured in some way, contain extra data.
2. Vocabularies
List of words in all active grammars, influenced by prompts and vice versa, Smaller
vocabularies tend to be more accurate.
11
1.5 General Telephony Terms
1.5.1. ASR- Automatic s Speech Recognition
Also called speech recognition engine (SRE), Turns speech into text, Does not mean
computer understand the meaning of the words.
1.5.2. Voice Recognition
Recognizes a specific speaker, used mainly for biometric, often confused with speech
recognition. Speech recognition recognizes what was said. Voice recognition recognizes who
said it.
1.5.3. TTS- Text-to-Speech
Turns written text into spoken words, the opposite of ASR.
1.5.4. VUI- Voice User Interface
An IVR interface for speech recognition, Think of GUI (graphical user interface).
1.5.5. MRCP- Media Resource control protocol
Standard method of controlling telephony platform, speech engines, etc, Comes in two
incompatible versions.
1.5.6. VXML- Voice extensible markup language
A standard language for writing IVRs, Supports playing prompts, using TTS, capturing input
via DTMF or ASR and performs logic. Essentially controls a platform and other media
resources.
1.6 Types of speech applications and Grammar Terms

Natural language: Allows callers to speak freely.
Directed dialogue: Callers are guided and reply with short commands
Grammar: List of words and phrases to be recognized
Vocabulary: The sum of all active grammars.
ABNF - Augmented Backus-Naur Form: Human readable/editable format.
GrXML Grammar XML: Mainly for machine generated grammars.
SISR: Standard language for adding logic.
12
1.7 Other Speech Terms:
Phoneme: Smallest unit of meaningful sound.
Utterance: What a speaker says at a given time.
Decode: The act of recognizing speech.
Confidence Score: Numeric indication of how likely the engines result is what was actually
said.
1.8 Traditional features

1.8.1 Linear Predictive Coding (LPC)
Removes the unnecessary sounds formed in the larynx. Formant is the parameter associated
with it. Inverse filtering Removal of formant. Residue Remaining sound. Similar to
LPC, but more transformation is used based on psycho physics. Three feature vectors are:
Critical Band Spectral Resolution, Equal- loudness curve and intensity loudness power law.
1.8.2 MFCC
Human perception of speech is linear until 1000 and logarithmic from there. Cepstrum is
Fourier Transform (logarithmic (Fourier Transform (Speech signal))). Cepstrum coefficient
are efficient than PLP and LPC because of their faster extraction technique. All the above
feature extraction method is extensively analyzed. MFCC gives better extraction compared to
PLP and LPC.MFCC is faster method compared to vector quantization and FFT.
Fig.1. MFCC STEPS IN SPEECH ANALYSIS
13
1.9 Speech recognition
Speech recognition (SR) is the inter-disciplinary sub-field of computational linguistics which
include understanding and research in the linguistics, electrical engineering fields and
computer science to expand methodologies and technologies that license the identification
and change of spoken speech into text by computers and computerized tool such as those
categorized as smart technologies and robotics. It is also known as computer speech
recognition, or just speech to text (STT), automatic speech recognition" (ASR),
"computer speech recognition".
Some SR structure use "training" (also called "enrollment") where an independent speaker
studies text or remote vocabulary into the system. The system examines the person's
particular voice and uses it to fine-tune the recollection of that person's speaking, resulting in
increased correctness. Systems that do not use instruction are called "speaker independent"
systems. Systems that utilize training are called "speaker dependent".
1.9.1 History
As early as 1932, Bell Labs researchers like Harvey Fletcher were investigating the science
of speech perception. In 1952 three Bell Labs researchers built a system for single-speaker
digit recognition. Their system worked by locating the formants in the power spectrum of
each utterance. The 1950s era technology was limited to single-speaker systems with
vocabularies of around ten words.
The 1990s saw the first introduction of commercially successful speech recognition
technologies. By this point, the vocabulary of the typical commercial speech recognition
system was larger than the average human vocabulary. In 2000, Lernout & Hauspie acquired
Dragon Systems and was an industry leader until an accounting scandal brought an end to the
company in 2001. The L&H speech technology was bought by ScanSoft which
became Nuance in 2005. Apple originally licensed software from Nuance to provide speech
recognition capability to its digital assistant Siri.
14
1.9.2 Models, methods, and algorithms
Both acoustic modeling and language modeling are important parts of modern statistically-
based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in
many systems. Language modeling is also used in many other natural language processing
applications such as document classification or statistical machine translation.
Hidden Markov models

Modern general-purpose speech recognition systems are based on Hidden Markov Models.
These are statistical models that output a sequence of symbols or quantities. HMMs are used
in speech recognition because a speech signal can be viewed as a piecewise stationary signal
or a short-time stationary signal. In a short time-scale (e.g., 10 milliseconds), speech can be
approximated as a stationary process. Speech can be thought of as a Markov model for many
stochastic purposes.
Another reason why HMMs are popular is because they can be trained automatically and are
simple and computationally feasible to use. The vectors would consist
of cepstral coefficients, which are obtained by taking a Fourier transform of a short time
window of speech and decor relating the spectrum using a cosine transform, then taking the
first (most significant) coefficients. The hidden Markov model will tend to have in each state
a statistical distribution that is a mixture of diagonal covariance Gaussians, which will give a
likelihood for each observed vector. Each word, or (for more general speech recognition
systems), each phoneme, will have a different output distribution; a hidden Markov model for
a sequence of words or phonemes is made by concatenating the individual trained hidden
Markov models for the separate words and phonemes.
Decoding of the speech (the term for what happens when the system is presented with a new
utterance and must compute the most likely source sentence) would probably use the Viterbi
algorithm to find the best path, and here there is a choice between dynamically creating a
combination hidden Markov model, which includes both the acoustic and language model
information, and combining it statically beforehand (the finite state transducer, or FST,
approach).
15
Dynamic time warping (DTW)-based speech recognition
Dynamic time warping is an approach that was historically used for speech recognition but
has now largely been displaced by the more successful HMM-based approach.
Dynamic time warping is an algorithm for measuring similarity between two sequences that
may vary in time or speed. For instance, similarities in walking patterns would be detected,
even if in one video the person was walking slowly and if in another he or she were walking
more quickly, or even if there were accelerations and deceleration during the course of one
observation. DTW has been applied to video, audio, and graphics indeed, any data that can
be turned into a linear representation can be analyzed with DTW.
Neural networks
Neural networks emerged as an attractive acoustic modeling approach in ASR in the late
1980s. Since then, neural networks have been used in many aspects of speech recognition
such as phoneme classification, isolated word recognition, and speaker adaptation.
In contrast to HMMs, neural networks make no assumptions about feature statistical

properties and have several qualities making them attractive recognition models for speech
recognition. When used to estimate the probabilities of a speech feature segment, neural
networks allow discriminative training in a natural and efficient manner. Few assumptions on
the statistics of input features are made with neural networks. However, in spite of their
effectiveness in classifying short-time units such as individual phones and isolated
words, neural networks are rarely successful for continuous recognition tasks, largely because
of their lack of ability to model temporal dependencies.
Deep Feed forward and Recurrent Neural Networks

A deep feed forward neural network (DNN) is an artificial neural network with multiple
hidden layers of units between the input and output layers. Similar to shallow neural
networks, DNNs can model complex non-linear relationships. DNN architectures generate
compositional models, where extra layers enable composition of features from lower layers,
giving a huge learning capacity and thus the potential of modeling complex patterns of
speech data.
LSTM also improved large-vocabulary speech recognition, text-to-speech synthesis, also for
Google Android, and photo-real talking heads. In 2015, Google's speech recognition
reportedly
16
Since the initial successful debut of DNNs for speech recognition around 2009-2011 and of
LSTM around 2007, there have been huge new progresses made. This progress (as well as
future directions) has been summarized into the following eight major areas:
1. Scaling up/out and speedup DNN training and decoding;

2. Sequence discriminative training of DNNs;
3. Feature processing by deep models with solid understanding of the underlying
mechanisms;
4. Adaptation of DNNs and of related deep models;
5. Multi-task and transfer learning by DNNs and related deep models;
6. Convolution neural networks and how to design them to best exploit domain
knowledge of speech;
7. Recurrent neural network and its rich LSTM variants;
Other types of deep models including tensor-based models and integrated deep
generative/discriminative models
1.9.3 Application
In-car systems
Typically a manual control input, for example by means of a finger control on the steering-
wheel, enables the speech recognition system and this is signaled to the driver by an audio
prompt. Following the audio prompt, the system has a "listening window" during which it
may accept a speech input for recognition.
Simple voice commands may be used to initiate phone calls, select radio stations or play
music from a compatible Smartphone, MP3 player or music-loaded flash drive. Voice
recognition capabilities vary between car make and model. Some of the most recent car
models offer natural-language speech recognition in place of a fixed set of commands,
allowing the driver to use full sentences and common phrases. With such systems there is,
therefore, no need for the user to memorize a set of fixed command words.
17
Health care
In the health care sector, speech recognition can be implemented in front-end or back-end of
the medical documentation process. Front-end speech recognition is where the provider
dictates into a speech-recognition engine, the recognized words are displayed as they are
spoken, and the dictator is responsible for editing and signing off on the document. Back-end
or deferred speech recognition is where the provider dictates into a digital dictation system,
the voice is routed through a speech-recognition machine and the recognized draft document
is routed along with the original voice file to the editor, where the draft is edited and report
finalized. Deferred speech recognition is widely used in the industry currently.
As an alternative to this navigation by hand, cascaded use of speech recognition and

information extraction has been studied as a way to fill out a handover form for clinical
proofing and sign-off. The results are encouraging, and the paper also opens data, together
with the related performance benchmarks and some processing software, to the research and
development community for studying clinical documentation and language-processing.
Military
High-performance fighter aircraft
Substantial efforts have been devoted in the last decade to the test and evaluation of speech
recognition in fighter aircraft. Of particular note have been the US program in speech
recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16
VISTA), the program in France for Mirage aircraft, and other programs in the UK dealing
with a variety of aircraft platforms. In these programs, speech recognizers have been operated
successfully in fighter aircraft, with applications including: setting radio frequencies,
commanding an autopilot system, setting steer-point coordinates and weapons release
parameters, and controlling flight display.
Helicopters
As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot
effectiveness. Encouraging results are reported for the AVRADA tests, although these
represent only a feasibility demonstration in a test environment. Much remains to be done
both in speech recognition and in overall speech technology in order to consistently achieve
performance improvements in operational settings.
18
Training air traffic controllers
Training for air traffic controllers (ATC) represents an excellent application for speech
recognition systems. Many ATC training systems currently require a person to act as a
"pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the
dialog that the controller would have to conduct with pilots in a real ATC situation. Speech
recognition and synthesis techniques offer the potential to eliminate the need for a person to
act as pseudo-pilot, thus reducing training and support personnel. In theory, Air controller
tasks are also characterized by highly structured speech as the primary output of the
controller, hence reducing the difficulty of the speech recognition task should be possible. In
practice, this is rarely the case. The FAA document 7110.65 details the phrases that should be
used by air traffic controllers. While this document gives less than 150 examples of such
phrases, the number of phrases supported by one of the simulation vendors speech
recognition systems is in excess of 500,000.
Telephony and other domain

ASR in the field of telephony is now commonplace and in the field of computer gaming and
simulation is becoming more widespread. Despite the high level of integration with word
processing in general personal computing. However, ASR in the field of document
production has not seen the expected increases in use.
Usage in education and daily life

For language learning, speech recognition can be useful for learning a second language. It can
teach proper pronunciation, in addition to helping a person develop fluency with their
speaking skills.
Students who are blind (see Blindness and education) or have very low vision can benefit
from using the technology to convey words and then hear the computer recite them, as well
as use a computer by commanding with their voice, instead of having to look at the screen
and keyboard.
19
People with disabilities
People with disabilities can benefit from speech recognition programs. For individuals that
are Deaf or Hard of Hearing, speech recognition software is used to automatically generate a
closed-captioning of conversations such as discussions in conference rooms, classroom
lectures, and/or religious services.
This type of technology can help those with dyslexia but other disabilities are still in
question. The effectiveness of the product is the problem that is hindering it being effective.
Although a kid may be able to say a word depending on how clear they say it the technology
may think they are saying another word and input the wrong one. Giving them more work to
fix, causing them to have to take more time with fixing the wrong word.
1.9.4 Performance
The performance of speech recognition systems is usually evaluated in terms of accuracy and
speed. Accuracy is usually rated with word error rate (WER) Vocalizations vary in terms of
accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed. Speech is
distorted by a background noise and echoes, electrical characteristics. Accuracy of speech
recognition varies with the following.
Vocabulary size and confusability

Speaker dependence vs. independence
Isolated, discontinuous, or continuous speech
Task and language constraints
Read vs. spontaneous speech
Adverse conditions
20
1.9.5 Accuracy
As mentioned earlier in this article, accuracy of speech recognition varies in the following:
Neural network classifies features into phonetic-based categories;
Given basic sound blocks that a machine digitized, one has a bunch of numbers which
describe a wave and waves describe words. Each frame has a unit block of sound, which are
broken into basic sound waves and represented by numbers which, after Fourier Transform,
can be statistically evaluated to set to which class of sounds it belongs. The nodes in the
figure on a slide represent a feature of a sound in which a feature of a wave from the first
layer of nodes to the second layer of nodes based on statistical analysis. This analysis
depends on programmer's instructions. At this point, a second layer of nodes represents
higher level features of a sound input which is again statistically evaluated to see what class
they belong to. Last level of nodes should be output nodes that tell us with high probability
what original sound really was.
Search to match the neural-network output scores for the best word, to determine the
word that was most likely uttered;
21
CHAPTER 2
ARTIFICAL INTELLIGENCE
2.1 Approaches
Artificial intelligence (AI) is the intelligence of machine and the branch of computer science
that aims to create it. AI textbooks define the field the study and design of intelligent agent
where an intelligent agent is a system that perceives its environment and takes actions that
maximize its chances of success. Cybernetics and brain simulation, Symbolic, Cognitive
simulation, Logic-based Anti- logic or scruffy, Knowledge-based, Sub-symbolic Bottom-
up, Embodied, Situated, Behavior-based or Nouvelle AI, Computational intelligence and soft
computing, Statistical, Integrating The Approaches: Intelligent Agent paradigm, Agent
Architecture and Cognitive Architecture.
2.1.1. Cybernetics and Brain simulation
In the 1940 and 1950s, a number of researchers explored the connection between neurology,
information theory, and cybernetics. Some of them built machine that used electronic
networks to exhibit rudimentary intelligence, such as W. Grey Walters turtles and the johns
Hopkins Beast. Many of these researchers gathered for meeting of the teleological Society at
Princeton University and the Ratio Club in England. By 1960, this approach was largely
abandoned, although element of it would be revived in the 1980s.
2.1.2. Symbolic
When access to digital computer became possible in the middle 1950s, AI research began to
explore the possibility that human intelligence could be reduced to symbol manipulation. The
research was centered in three institutions: Carnegie Mellon University, Stanford and MIT,
and each one developed its own style of research. John Haugeland named these approaches to
AI good old fashioned AI or GOFAI. During the 1960s, symbolic approaches had
achieved great success at simulating high- level thinking in small demonstration program.
Approaches based on cybernetics or neural networks were abandoned or pushed into the
background. Researchers in the 1960s and the 1970s were convinced that symbolic
approaches would eventually succeed in creating a machine with artificial general
intelligence and considered. This goal of their field.
22
2.1.3. Cognitive Simulation
Economist Herbert Simon and Allen Newell studied human problem-solving skills and
attempted to formalize them, and their work laid the foundation of the field of artificial
intelligence, as well as cognitive science, operation research and management science. Their
research team used the results of psychological experiment to develop programs that
simulated the techniques that people used to solve problems. This tradition, centered at
Carnegie Mellon University would eventually culminate in the development of the soar
architecture in the middle 1980s.
2.1.4. Logic-based
Unlike Newell and Simon, John McCarthy felt that machines did not need to simulate human
thought, but should instead try to find the essence of abstract reasoning and problem solving,
regardless of whether people used the same algorithms. His laboratory at Stanford (SAIL)
focused on using formal logic to solve a wide variety of problem, including knowledge
representation, planning and learning. Logic was also the focus of the work at the university
of Edinburgh and elsewhere in Europe which led to the development of the programming
language and the science of logic programming.
2.1.5. Anti-logic or scruffy
Researchers at MIT (such as Marvin Minsky and Seymour papert) found that solving difficult
problems difficult problems in vision and natural language processing required ad- hoc
solution they argued that there was no simple and general principle (like logic) that would
capture all the aspects of intelligent behavior. Roger Schank described their anti-logic
approaches as scruffy ( as opposed to the neat paradigms at CMU and standard).Common
sense knowledge bases ( such as Doug Lenats Cyc) are an example of scruffy AI, since
they must be built by hand, one complicated concept at a time.
23
2.1.6. Knowledge- based
When computer with large memories became available around 1970, researchers from all
three traditions began to build knowledge into AI application. This knowledge revolution
led to the development and deployment of expert systems (introduced by Edward
Feigenbaum), the first truly successful form of AI software. The knowledge revolution was
also driven by the realization that enormous amounts of knowledge would be required by
many simple AI applications.
2.1.7. Sub- symbolic
By the 1980s progress in symbolic AI seemed to stall and many believed that symbolic
systems would never be able to imitate all the processes of human cognition, especially
perception, robotics, learning and pattern recognition. A number of researchers began to look
into sub-symbolic approaches to specific AI problems.
2.1.8. Bottom-up, Embodied, Situated, Behavior-based or Nouvelle AI
Researchers from the related field of robotics, such as Rodney Brooks, rejected symbolic AI
and focused on the basis engineering proems that would allow robots to move survive. Their
work revived the non-symbolic viewpoint of the early cybernetics researchers of the 1950s
and reintroduced the use of control theory in AI. This coincided with the development of the
embodied mind thesis in the related field of cognitive science: the idea that aspects of the
body (such as movement, perception and visualization) are required for higher intelligence.
2.1.9. Computational Intelligence and Soft Computing
Interest in neural network and connectionism was revived by David Rumelhart and others
in the middle 1980s. Neural networks are an example of soft computing--- they are solutions
to problems which cannot be solved with complete logical certainty, and where an
approximate solution is often enough. Other soft computing approaches to AI include fuzzy
systems, evolutionary computation and many statistical tools. The application of soft
computing to AI is studied collectively by the emerging discipline of computational
intelligence.
24
2.1.10. Statistical
In the 1990s, AI researcher developed sophisticated mathematical tools to solving specific

sub problem. These tools are truly scientific, in the sense that their results are both
measurable and verifiable, and they have been responsible for many of AIs recent successes.
The shared mathematical language has also permitted a high level of collaboration with more
established fields (like mathematic, economics or operations research). Stuart Russell and
Peter Norvig describe this movement as nothing less than a revolution and the victory of
the neats.
2.1.11. Integrating the Approaches: Intelligent Agent paradigm
An intelligent agent is a system that perceives its environment and takes actions which
maximize its chance of success. The simplest intelligent agents are programs that solve
specific problems. More complicated agents include human beings and organization of
human beings (such as films). The paradigm gives researchers license to study isolated
problems and find solutions that are both verifiable and useful, without agreeing on one
single approach. An agent that solves a specific problem can use any approach that works
some agents are symbolic and logical, some are sub- symbolic neural networks and others
may use new approaches.
2.1.12. Agent Architecture and cognitive Architectures
Researchers have designed system to build intelligent system out of interacting intelligent
agent in a multi-agent system. A system with both symbolic and sub-symbolic components is
a hybrid intelligent system, and the study of such systems is artificial intelligence systems
integration. A hierarchical control system provide a bridge between sub-symbolic AI at its
lowest, reactive levels and traditional symbolic AI at its highest levels, where relaxed time
constraints permit planning and world modeling. Rodney Brooks subsumption architecture
was an early proposal for such a hierarchical system.
25
2.2 Tools
Search and optimization, Logic, Probabilistic methods for uncertain reasoning, Classifiers
and statistical learning methods, neural networks , deep neural networks , Control theory,
Language.
2.2.1. Search and optimization
Many problems in AI can be solved in theory by intelligently searching through many

possible solutions. Reasoning can be reduced to performing a speech. For example, logical
proof can be viewed as searching for a path that leads from premises to conclusion, where
each step is application of an inference rule. Planning algorithms search through trees of
goals and sub goals, attempting to find a path to a target goal, a process called means-ends
analysis. Robotics algorithms for moving limbs and grasping objects use local searches in
configuration space. A very different kind of search came to prominence in the 1990s, based
on the mathematical theory of optimization. These algorithms can be visualized as blind hill
climbing: we begin the search at a random point on the landscape, and then, by jumps or
steps, we keep moving our guess uphill, until we reach top. Other optimization algorithms are
random simulated annealing, beam search and random optimization.
2.2.2. Logic
Logic is used for knowledge representation and problem solving, but it can be applied to
other problems as well. Propositional or sentential logic is the logic of statements which can
be true or false. First-order logic allows the use of quantifiers and predicates, and can express
facts about object, and their relations with each other. First-order logic allows the use of
quantifiers and predicates, and can express facts about objects, their properties, and their
relation with each other. Fussy logic, is a version of first- order logic which allows the truth
of a statement to be represented as value between 0 and 1, rather than simply True(1) or
False(0). Default logics, non- monotonic logics and circumscription are forms of logic
designed to help with default reasoning and the qualification problem. Several extension of
logic have been designed to handle specific domains of knowledge, such as: description
logics; situation calculus, event calculus and fluent calculus (for representing events and
time); causal calculus, belief calculus, and modal logics.
26
2.2.3. Probabilistic Methods for Uncertain Reasoning
Many problems in AI (learning, robotics, perception, and reasoning) require the agent to
operate with incomplete or uncertain information. Bayesian networks are a very general tool
that can be used for a large number of problem: planning (using decision networks),
reasoning (using the Bayesian inference algorithm), perception (using dynamic Bayesian
networks), and learning (using the expectation-maximization algorithm). Probabilistic
algorithms can also be used for filtering, prediction, smoothing and finding explanations for
streams of data, helping perception systems to analyze processes that occur over time (e.g.
Kalman filters or hidden Markov models).
2.2.4. Classifiers and Statistical Learning Methods
The simplest AI applications can be divided into two types: controller (if shiny then pick
up) and classifier (if shiny then diamond). Controllers do, however, also Classify
condition before inferring actions, therefore classification forms a central part of many AI
systems. Classifier are function that use pattern matching to determine a closest match. In
supervised learning, each pattern belongs to a certain predefined class. A class can be seen as
a decision that has to be made. A classifier can be trained in various ways; there are many
statistical approaches and machine learning approaches. The most widely used classifier are
the neural network, kernel methods such as the support vector machine, k-nearest neighbor
algorithm, Gaussian mixture model, nave Bayes classifier, and decision tree.
2.2.5. Neural networks (NN)

A NN is an interconnected group of nodes, akin to the vast network of neurons in the human
brain. The study of artificial neural networks began in the decade before the field of AI
research was founded, in the work of Walter Pitts and warren McCullough. Other important
early researchers were frank Rosenblatt, who invented the perceptron and Paul Werbos who
developed the back propagation algorithm. The main categories of networks are acyclic or
feed forward neural networks (where the signal passes in only one direction) and recurrent
neural networks (which allow feedback). NN can be applied to the problem of intelligent
control (for robotics) or learning, using such techniques as Hebbian learning and competitive
learning.
27
2.2.6. Deep Neural Networks (DNN)
A DNN is an ANN with multiple hidden layer of units between the input and output
layers. Similar to shallow ANN, DNN can model complex non linear relationships. Over the
last few years, advance in both machine learning and computer hardware have led to more
efficient methods for training DNN that contain many layers of non-linear hidden units and a
very large output layer.
2.3 Language and Problem deducted through AI
Several languages are developed for AI research, including Lisp and prolog and problem
deducted through AI are Motion and Manipulation, Reasoning, Deduction, Problem
solving, Social Intelligence, Natural intelligence.
2.4 Advantage
Complex work that humans cannot do/may struggle, Can take on stressful, Can complete task
faster than a human can most likely, less errors and defect. Function Infinite. Discover
unexplored things i.e. outer space.
2.5 Disadvantage
Ability to replace human jobs, Lack human touch, misused leading to mass scale
destruction
Can malfunction and do the opposite of what they are programmed to do.
2.6 Application
Gesture recognition, Individual voice recognition, Global voice recognition, Robot
navigation
Facial expression, recognition for interpretation of emotion and nonverbal queues.
28
CHAPTER 3
SPEECH EMOTION RECOGNITION
3.1 INTRODUCTION
With the advancement of the technology related to human-computer interaction (HCI)
involving artificial intelligence and robotics, there has been considerable demand for having
computer with more friendly HCI so that computer can perform the humanized function and
the most important steps in this direction is to make necessary changes in the computer for
understanding the human emotions. In the human communication, the most important method
of transferring information is through the speech. Owing to the development of better voice
sensors, the speech emotion recognition (SER) becomes important in HCI. The objective of
SER is to understand the human emotion from the voice signals so that the communication
between human and computer become more human friendly. Particularly, the computer must
be able understand the current emotional state of the communicator from voice signals, and
thus analyzing and studying speech emotion interface become important in determining the
current emotional state of the communicator. The process of SER involves the acquisition of
voice emotional data, extraction of emotional feature, and recognition of emotion. This block
diagram of SER process is shown in the Fig. 2:
Fig.2. BLOCK DIAGRAM OF SPEECH EMOTION RECOGNITION
One can do the recognition of speech by voice feature extraction algorithm. However, the
result of the speech recognition is greatly affected by voice of the communicator, speaking
style of the communicator, place of the communication where communication occurs such as
noise in the background, the timing of the voice signals between communicator, and so on.
This process is also made complex by use of different pronunciation and auditory systems
used by the humans. Thus, there is need to determine the ideal feature for completely
29
describing voice signal. As a result of these factors, the extraction voice feature/emotion
feature and SER become more difficult. Some of the most important traditional voice
features, used at present time, are Perceptual Linear Prediction cepstral coefficient (PLP) and
Zero-crossings with Peak-amplitudes (ZCPA). These features have significant effect on
speech recognition and their by gradually enriching the speech database which in turn leads
to enrichment in the emotional speech data base. In order to determine the laws governing the
relationship between the emotional features and speech from a voice data base. These
traditional features are insufficient and lack the requisite information. In the present time,
deep learning technology has been extensively used to solve the problems for determining the
characteristics features involving large database. The deep learning technology has been
successfully employed in speech and image recognition, but used seldomly for the speech
emotion recognition. In this project, we present the systematic approach for utilization of
deep learning technique in the extraction of the speech emotional features, DLT
automatically analyzes the input data and obtained the required feature for classification of
data, and thereby making the recognition and classification much more simpler and can be
completed in very short span of time. Thus, it is right full to use the DTL for speech emotion
recognition
30
3.2 SPEECH EMOTION FEATURES
Speech feature is an essential element in the SER process. Speech of the communicator must
reflect his emotion and devoid of the other effects such as voice of the communicator,
methods of delivering the speech, content of the speech, and so on. The phonetic features of
the speech are divided into three different categories: prosodic features, spectrum
characteristics, and other characteristics.
Prosodic Features
Prosodic features contains the information related to emotions in the speech, and has been
used in SER process. The classical prosodic features consist of the following:
a. temporal feature, duration of the time;
b. frequency of the pitch;
c. energy;
d. formant, such as first order, second order and bandwidth of the formant;
e. glottal parameters ;
f. contents such as phrase, idioms, word etc ;
g. time structure.
Spectrum Features
Spectrum features are generally described by the small time representation of speech signals.
Some of the important spectral features consist of the following: Fourier transform for shorter
time, Perceptual Linear Prediction cepstral coefficient (PLP) and Zero-crossings with Peak-
amplitudes (ZCPA). Further, scientists have proposed new important features such as
performing wavelet transformation associated with each frame of the speech signal and then
obtaining its Fourier transformation.
31
Other Features
In addition to the prosodic features and spectral features, there are several other common
characteristics in SER. These characteristics includes
a. Speech feature build on Teager energy operator (TEO) used to determine the non linear
features from the speech signals;
b. Speech features build on empirical mode decomposition (EMD) to analyzes the non linear
and non stationary speech signals;
c. Speech features build on fractal dimension;
d. Features build on the deep learning technique.
In the complex acoustic environment, traditional features determined through imitating

human auditory effect become very difficult. The irrelevant information present in the speech
features also affects the accuracy of the speech recognition process.
32
3.3 DEEP AUTO-ENCODER
Hinton, in 2006, proposed the deep belief networks. The term deep belief network or simply
deep network is obtained from artificial neural network. Since then the deep network has
been the topic of discussion in the field of artificial intelligence. In the deep learning theory,
each layer can be analyzed independently without obtaining the output from previous layer.
The objective is to learn the abstract representation from the given input information, and
these abstract representation is further used as input for supervising learning task such as
regression and classification.
3.3.1 Theory of Deep Auto-encoder

This project, which is based on DTL, proposed the application of deep auto encoder (DAE) in
the extraction of speech emotion features. DAE is a special type of deep neural network in
which the dimension of input and output are same. The output and input works in an iterative
manner i.e., output from the original input become the next input in the network process and
so on. DAE makes a reversible conversion between the spatial distribution of given data and
the special features present in the space, which can be inferred as the reconstruction and
decomposition of the given input signal. Thus, DAE not only help in learning the encoding
mechanism but also in determining the abstract representation of original data hidden in the
subsequent layers i.e., feature extraction. After analyzing the original input, prior knowledge
of the class information of the training sample is not required. This essentially means that the
large amount of unclassified data is analyzed using unsupervised learning. An auto encoder
consists of one or more hidden layer sandwiched between input layer and output layer. If the
hidden layer is more than one, then such an encoder is referred as deep auto-encoder. DAE
extract the feature representation of the original data from a number of hidden layers, and
then the resulting representation was decoded into hidden layers which were reconstructed at
the output layer. The network structure used in the project is shown in the Fig.3.
33
Fig.3. DEEP AUTO-ENCODER MODEL
In order to change the parameter of DAE, the minimized square error between the input and
its reconstruction at the output layer is used as the objective function during the training
period. The encoder is used to map the raw input to the hidden level and this mapping is
given by,
= = + (1)
Where is the set of parameters and = { , } is the non linear activation function.
The decoder map the learned features from the hidden layer by reconstruction at output layer
and this reconstruction is expressed as,
y = g = ( , + ) (2)
Where is the set of parameters and = {

, } is the non linear activation function and the
T
parameters satisfy the relation w ' = w
34
Further, the activation function between hidden layer of the network may be linear or non
linear function. In general, the non linear activation function is defined by Sigmoid function
which is expressed as,
Sigmoid = + 1
(3)
The objective, during the training, is to determine the set of parameters which minimizes
={ , , }
The reconstruction error. This set of parameters is expressed as,
L , = (4)

L , = = log + 1 log 1 ) (5)
In general, the reconstruction error function is defined by squared error function or non linear
cross loss entropy function expressed by the relation given below,
One must note that the squared error function activates the linear function while the cross
entropy function activates the non linear sigmoid functions. Square error function is used to
activate a linear manner, and the cross-entropy loss function is for sigmoid activation mode.
35
3.3.2 The Training of Deep Auto-Encoder
Initially, a pre-training algorithm of layer by layer for setting the parameters are used in the
coding of the DAE, and as the image gets enlarge, the initial set parameters of decoding
become active, and finally set the network parameters in the network training. It is necessary
to mention that DAE, by itself, cannot do classification of data, but can only extract features
from the hidden layer. Thus, a classifier is attached at the end to perform the classification of
the data.
The objective of the layer by layer logarithm is to train a single layer of the network at
particular point of time. The next training for the first two layers of the network begins only
after completion of the training of first layer, and than using the analogy. Thus, one can say
that initial parameter for L-1 are set, and then finally add the L layers. The training process
for the deep network consists of the following steps:
a. Using the unsupervised method for training of the first layer, and the resulting output is
further obtained by minimizing the reconstruction error of the original input;
b. Using the output of initial hidden layer as the input to the next layers of the network, and
the unlabeled data samples are used for the training of the subsequent layer within the range
of certain errors;
c. The step 2 is repeated till the completion of training of the whole network;
d. The output of the last hidden layer acts as the input to the supervised layer. The parameters
of the trained layer set the parameter of the whole network.
e. Finally, all the layers of the entire network are fine tuned with the supervision in
accordance with supervised learning method.
36
The block diagram for the pre training of the DAE is shown in the Fig. 4.
Fig.4. PRE-TRAINING DIAGRAM OF DEEP AUTO-ENCODER
The fine tuning of the deep network is based on the back propagation algorithms. The central
theme of the fine tuning algorithm rests upon the assumption of treating the entire deep
network as one network, and then readjusting the per-trained by supervised learning
technique. After doing several iteration in this manner, the parameters related to weight and
bias are optimized. For the sample training set {( , ), . . ( , ) } of DAE, when
satisfies = the condition, the overall cost function for the sample training function. Is
expressed as,

J , =[ = , ( ) ] + = = =+ ( ) (6)
Where , ( ) gives the activation value of the output for the values of the input and
and are the weight attenuation. This first term is the squared error function giving the values
of the input reconstructed by encoder and the initial input. The second term is nothing but
the rule term. The cost function is minimized by taking the partial derivatives J , with
respect to the network weight and bias parameters. Steps taken in this process are as follows:
37
1) Forward propagation algorithm is used to obtain the activation probability of each node
from second layer to output layer and is is given by
=f (7)
Where the bias parameter in is neural value for layer 1 and is the connecting weight
between layer n and n+1.
2) The residual error of the first n neuron in the output layer is
=- . (8)
3) Determining the residual error of the neurons with respect to the first hidden layer neuron.
The residual of the n the node of the nth layer is:
= (9)
is the weight of the unit j connected the unit i of layer l with the unit j of layer l+1.
38
4) Determine the partial derivative.
J = (10)
J = (11)
5) Optimizing the set of parameters of network weights and node bias.
= J = - (12)
= - (13)
The algorithm used for the feature extraction is based on DAE. The algorithm, creates an
intermediate hidden layer which connect the original input of the deep network via non linear
mapping of the multiple hidden layers and the extracted features is again reconstructed at the
output layer.
39
CHAPTER 4
4.1 DESIGN OF EXPERIMENT
In this project, the speech emotion data base is taken from the Chinese Academy of Sciences
Institute of automation (CASIA). This database is taken for 4 people (2 male and 2 female)
and includes 6 types of emotion statements such as, fear, anger, This database is Chinese
emotion corpus created by four actors (2 males and 2 females). Six kinds of emotion
statements are: anger, fear, sadness, neutral, surprise and happiness.
TABLE I. ALL KNDS OF EMOTION RECOGNITION RESULTS BASED ON

DIFFERENT FEATURES
EMOTION ANGER FEAR SADNESS NEUTRAL SURPRISE HAPPINESS

FEATURES
MFCC 74.24 73.12 74.77 72.56 70.59 71.44
LPCC 73.51 71.20 70.34 75.55 67.81 68.99
PLP 81.71 80.07 80.58 82.68 76.43 75.82
DAE 86.40 83.96 84.37 83.97 81.13 80.17
FEATURE
40
In this project, 50% of the data is used for training and rest of data for testing.
The voice data process is described in the following manner:
a. Preprocessing of the emotional speech signal.
b. Extraction of the emotional features from the speech signals such as MFCC, PLP,
LPCC and finally utilizing them for classification and recognition build on classifier
namely support vector machine.
c. Input data for the DAE is developed using following steps:
1. Performing wavelet decomposition associated with each frame of the speech signal
and then obtaining its Fourier transformation.
Data associated with the each free is further extended before and after the frame to obtain the
successive super frame of the voice data.
Normalizing the data of the super farm, and then form training sample for the visible layer of
the network. In this project, DAE consists of 5 hidden layers to learn coding and decoding
network which are mirror symmetrical to each other. This data network structure is correctly
shown in the Fig. 3.
The SERs process is based on the DAE in which visible layers and first hidden layers are
trained initially. Fifty percent of the training data is used as the input of the deep network
after setting the connection weight and node weight. After performing the several iterations,
the last probability values obtained for each node acts as the input for the next layer, and,
then, continue to set up the connection weight of the other hidden layers until the completion
of the training. After performing the layer by layer pre-training algorithms for setting up the
network parameter, and then minimizing the reconstruction error to fine tuned network based
on the back propagation algorithm. After finishing the optimization of weight and bias, the
output layer in the back of hidden layer H3 are added to it. In order to minimize the
classification error, the output of the hidden layer 3 is used as the approximate standard
feature of the original speech emotion by supervising training to reset the parameters of the
deep networks. Finally using the optimized parameter in the SVM of learned feature, we
obtain the recognition. At the same point of time, we obtain the recognition result by
applying the traditional features as input to support vector SVM. The result obtained from
41
support vector is compared with the result obtained from the learned feature by DAE. The
best accuracy obtained for the emotional recognition is obtained through programming in
MATLAB shown in the table 1.
42
4.2 Experimental Results and Analysis
The emotional recognition obtained using DAE showed an enhancement in recognition by
4.85% with highest accuracy 0f 86.41 in comparisons to the recognition from traditional
feature. It means the learned features by DAE make better emotional recognition. The other
important aspect of the experiment is the improvement in the recognition result by using
learned feature in DAE then the recognition result obtained from speech emotion data for
one male group with respect to particular emotional statement i.e., angry emotion and happy
emotion .
43
4.3 CONCLUSION
The speech recognition is a complex process but plays an important role in the HCI. The
researchers are putting greater emphasis to understand the different facets of speech
recognition by employing different technology depending upon their requirement. This
project put emphasis on the use new technique DAE to learn features and recognize emotion
with greater accuracy than the available traditional voice recognition. Owing to lack
availability of emotion voice database and paucity of time, my results are far from the ideal
and require further improvement. In higher studies, I would devote my time for using better
emotional voice database and exploring different processing technique on the input of DAE
to determine better emotion features from the sample speech signal.
44
REFERENCES
[1] Li Zhao, Chunhui Jiang, Cairong Zou. Study on emotion feature analysis and recognition
in speech signal. Chinese Journal of Electronics. 2004, 32(4): 606-609.
[2] Li Zhao. Processing of speech signal (Second Edition). Beijing: China Machine Press,
2011
[3] Wenjing Han, Haifeng Li, Huabin Ruan, Lin Ma. Review of the progress in the research
of speech emotion recognition. Journal of Software, 2014, 25(1):37-50.
[4] Maorong Wang, Ping Zhou, Xinxing Jjing. Mixing parameters of MFCC and short TEO
energy used in speaker recognition. Microelectronics and Computer. 2016 (01).
[5] Yaxin Sun. Research on feature extraction and recognition algorithm in speech emotion
recognition. Guang Zhou: South China University of Technology.
[6] Yongming Huang, Guobao Zhang, Yue Li. Improved Emotion Recognition with Novel
Task-Oriented Wavelet Packet Features. Intelligent Computing Theory, 2014, 8588: 706-714.
[7] M Ziolko. Combination of Fourier and wavelet transformations for detection of speech
emotions. IEEE International Conference on Acoustics, Speech and Signal Processing, USA,
2014: 2576-2580.
[8] Zhijun Sun, Lei Xue, Yangming Xu. Review on the research of deep learning. Computer
study and application, 2012, 29(8): 2806-2810.
[9] Jing Liang. The research of speech recognition based on deep learning.Beijing: Beijing
University of Posts and Telecommunications.
[10] Jianling Qu, Chenfei Du, Yazhou Di, Feng Gao,Chaoran Guo. Research and prospect of
deep auto-encoder. Computer and Modernization. 2014(08).
45
[11] Guyon G, Dror V, Lemaire G. Auto-encoders, unsupervised learning, and deep
architecture. Journal of Machine Learning Research, 2012, 13: 37-49.
[12] Juxia Zhu, Xiaopei Wu, Zhao Lv. Speech emotion recognition algorithm based on SVM.
Application of computer system. 2011(05).
[13] H. Akaike, A new look at the statistical model identification, IEEE Trans. Autom.
Control 19 (6) (1974) 716723.
[14] N. Amir, S. Ron, N. Laor, Analysis of an emotional speech corpus in Hebrew based on
objective criteria, in: SpeechEmotion-2000, 2000, pp. 2933.
[15] J. Ang, R. Dhillon, A. Krupski, E. Shriberg, A. Stolcke, Prosody-based automatic

detection of annoyance and frustration in humancomputer dialog, in: Proceedings of the
ICSLP 2002, 2002, pp. 20372040.
[16] B.S. Atal, Effectiveness of linear prediction characteristics of the speech wave for
automatic speaker identification and verification, J. Acoust. Soc. Am. 55 (17) (1974) 1304
1312.
[17] T. Athanaselis, S. Bakamidis, I. Dologlou, R. Cowie, E. Douglas-Cowie, C. Cox, Asr

for emotional speech: clarifying the issues and enhancing the performance, Neural
Networks 18 (2005) 437444.
[18] M.M.H. El Ayadi, M.S. Kamel, F. Karray, Speech emotion recognition using Gaussian
mixture vector autoregressive models, in: ICASSP 2007, vol. 4, 2007, pp. 957960.
[19] R. Banse, K. Scherer, Acoustic profiles in vocal emotion expression, J. Pers. Soc.
Psychol. 70 (3) (1996) 614636.
[20] A. Batliner, K. Fischer, R. Huber, J. Spiker, E. Noth, Desperately seeking emotions:

actors, wizards and human beings, in: Proceedings of the ISCA Workshop Speech Emotion,
2000, pp. 195200.
[21] S. Beeke, R. Wilkinson, J. Maxim, Prosody as a compensatory strategy in the

conversations of people with agrammatism, Clin. Linguist. Phonetics 23 (2) (2009) 133155.
[22] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995.
46
[23] M. Borchert, A. Dusterhoft, Emotions in speechexperiments with prosody and quality
features in speech for use in categorical and dimensional emotion recognition environments,
in Proceedings of 2005 IEEE International Con-ference on Natural Language Processing and
Knowledge Engineering, IEEE NLP-KE05 2005,2005, pp.147151.
[24] L. Bosch, Emotions, speech and the asr frame work, Speech Commun. 40 (2003) 213
225.
[25] S. Bou-Ghazale, J.Hansen, Acomparative study of traditional and newly proposed

features for recognition of speech under stress, IEEE Trans. Speech Audio Process. 8 (4)
(2000) 429442.
[26] R. Le Bouquin, Enhancement of noisy speech signals: application to mobile radio

communications, Speech Commun. 18 (1) (1996) 319.
[27] C. Breazeal, L. Aryananda, Recognition of affective communicative intent in robot-

directed speech, Autonomous Robots 2 (2002) 83104.
[28] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123140.
[29] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data
Mining Knowl. Discovery2 (2) (1998) 121167.
[30] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, A database of German

emotional speech, in: Proceedings of the Interspeech 2005, Lissabon, Portugal, 2005, pp.
15171520.
[31] C. Busso, S. Lee, S. Narayanan, Analysis of emotionally salient aspects of fundamental

frequency for emotion detection, IEEE Trans. Audio Speech Language Process.17 (4) (2009)
582596.
[32] J. Cahn, The generation of affect in synthesized speech, J. Am. Voice Input/ Output
Soc.8 (1990) 119.
47
48
49

Deep Auto Encoder for Speech Emotion Recognition

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Auto Encoder for Speech Emotion Recognition

Uploaded by

Copyright:

Available Formats

SPEECH EMOTION RECOGNITION BASED ON DEEP

Department of Electronics and Communication

JAYPEE INSTITUTE OF INFORMATION TECHNOLOGY

(Declared Deemed to be University U/S 3 of UGC Act)

A-10, SECTOR 62, NOIDA, INDIA

Dr. Rajesh Kumar Dubey

LIST OF ACRONYMS AND ABBREVIATIONS (vii)

LIST OF SYMBOLS (9)

LIST OF FIGURES (9)

LIST OF TABLES (10)

Chapter 1: SPEECH RECOGNITION BASIC

HCI Human computer interaction

IVRs Interactive voice response system

ASR Automatics speech recognition

SE Speech recognition engine

TTS Text to speech computing

VUI Voice user interface

MRCP Media resources control protocol

VXML Voice extensible markup language

ABNF Augmented backup- Naur form

GrXML Grammar XML

SISR Standard language for adding logic

LPC Linear predictive coding

Figure Caption Page

1.8 Mel frequency cepstral coefficient (MFCC) 13

3.2 Block diagram of speech emotion recognition 30

3.5 Deep auto- encoder model 34

3.7 Pre- training diagram of deep auto encoder 37

4.1 All kinds of emotion recognition results based on different features.40

1.2 Reason for speech and process

1.3 Types of speech recognition

1.3.2. Speaker independent

1.4 Grammars and vocabularies

1.5.2. Voice Recognition

1.5.3. TTS- Text-to-Speech

Turns written text into spoken words, the opposite of ASR.

1.5.4. VUI- Voice User Interface

1.5.5. MRCP- Media Resource control protocol

1.5.6. VXML- Voice extensible markup language

1.6 Types of speech applications and Grammar Terms

Grammar: List of words and phrases to be recognized

Vocabulary: The sum of all active grammars.

ABNF - Augmented Backus-Naur Form: Human readable/editable format.

GrXML Grammar XML: Mainly for machine generated grammars.

SISR: Standard language for adding logic.

Utterance: What a speaker says at a given time.

Decode: The act of recognizing speech.

1.8 Traditional features

Fig.1. MFCC STEPS IN SPEECH ANALYSIS

Hidden Markov models

In contrast to HMMs, neural networks make no assumptions about feature statistical

Deep Feed forward and Recurrent Neural Networks

1. Scaling up/out and speedup DNN training and decoding;

As an alternative to this navigation by hand, cascaded use of speech recognition and

Telephony and other domain

Usage in education and daily life

Vocabulary size and confusability

Neural network classifies features into phonetic-based categories;

2.1.1. Cybernetics and Brain simulation

2.1.5. Anti-logic or scruffy

2.1.7. Sub- symbolic

2.1.8. Bottom-up, Embodied, Situated, Behavior-based or Nouvelle AI