You are on page 1of 9

Advanced methods in data analysis

Andrei Ciceu
gr. 246

Keyword spotting using deep


neural networks
Abstract

The goal of this paper is to present the design and


functionality of the Keyword spotting software I developed
for the Advanced methods in data analysis course. As my
main theme was deep neural networks and deep
learning, in developing this software, Ive used this
concepts. The paper begins by describing the programming
language, the tools and the libraries used in developing this
software. After, I present the structure of the project and the
interactions between modules, along with examples of the
training data used in training the deep neural network. We
end by stating the conclusions and the observations taken
during developing, testing and using the software.

Introduction
The goal of this paper is to present the keyword spotting
software developed for this course, which makes use of deep
neural networks.
Ive chosen this topic because I intend to further use this
approach in other projects, outside of the faculty and because
it is of high interest to me.
I first go over the libraries and tools used in developing
the software. Then, I follow by describing the implementation
and the structure of the project; after, we have a look over
the training data.
We end with the conclusions.
1

Advanced methods in data analysis


Andrei Ciceu
gr. 246

Advanced methods in data analysis


Andrei Ciceu
gr. 246

Python and PyBrain


For developing the software, Ive chosen Python 2.7,
because it allows the programmer to develop the software
fast, with little hassle. It is very good for prototyping and
testing purposes, but it is not suitable for large software, as it
has some performance drawbacks (being an interpreted
language).
PyBrain [1] is a modular Machine Learning Library for
Python. Its goal is to offer flexible, easy-to-use yet still
powerful algorithms for Machine Learning Tasks and a variety
of predefined environments to test and compare your
algorithms. It offers a quick way of designing software that
uses machine learning. By using the numPy library, it also
has pretty good computational times.

Feature extraction
Ive looked over different features that can be extracted
from raw speech or recorded speech (wav files), and Ive
chosen log-mel filterbanks as the features that Ill be using
with the deep neural network. This has been proven to work
[2] in previous studies, and its also the standard used by
other speech recognition software.
For extracting features, the module generates acoustic
features based on 40-dimensional mel-log-filterbank energies
computed every 10 ms over a window of 25 ms. Contiguous
frames are stacked to add sufficient left and right context.
The input window is asymmetric since each additional frame
of future context adds 10 ms of latency to the system. The
3

Advanced methods in data analysis


Andrei Ciceu
gr. 246
DNN KWS approach uses 10 future frames and 30 frames in
the past
For this Ive used the pyAudioAnalysis library [3],
which provides the functions needed to extract the log-mel
filterbanks.
The data fed to the network was normalized, between -1
and 1, to avoid overflows (which were pretty common at the
beginning, with PyBrains float32)

The deep neural network


The deep neural network is a standard feed forward
network and has the following structure:
Input: 1 linear layer, consisting of 1600 units
Hidden: 3 ReLU layers, consisting of 128 units each
Output: 1 SoftMax layer, consisting of labels+1 units
Labels represent the keywords that can be spotted. The
software I developed uses only 1 keyword for detection, but it
can easily be adapted to more if needed. As the last layer is a
SoftMax, it returns confidence scores that sum up to 1.
ReLU units have been proven to have faster training
times.
For being able to retrieve previous trainings, the
structure of the DNN is saved into an xml file after each
epoch of training and can be loaded when starting the
software.
Ive also tried to implement this DNN myself, but it
proved to be worse in terms of performance than using
4

Advanced methods in data analysis


Andrei Ciceu
gr. 246
PyBrain, which makes us numPys fast matrix multiplication
features and other tools.
This is the code used to initialize PyBrains neural
network:

#Initializing layers & network


self.__inLayer = LinearLayer(40*40)
self.__network.addInputModule(self.__inLayer)
self.__hiddenLayers = []
for i in range(0, self.__hiddenLayersCount):
self.__hiddenLayers.append(ReluLayer(self.__neuronsPerLayer))
self.__network.addModule(self.__hiddenLayers[i])
self.__outLayer = SoftmaxLayer(self.__labels)
self.__network.addOutputModule(self.__outLayer)

#Connecting layers
self.__connections = []
self.__connections.append(FullConnection(self.__inLayer,
self.__hiddenLayers[0]))
for i in range(1, self.__hiddenLayersCount):
self.__connections.append(FullConnection(self.__hiddenLayers[i-1],
self.__hiddenLayers[i]))

self.__connections.append(FullConnection(self.__hiddenLayers[self.__hiddenL
ayersCount-1], self.__outLayer))
for i in range(0, len(self.__connections)):

Advanced methods in data analysis


Andrei Ciceu
gr. 246
self.__network.addConnection(self.__connections[i])
self.__network.sortModules()
return False

Training
Suppose Pij is the neural network posterior for the i-th
label and the j-th frame xj, where i takes values between 0,
1 ... n 1, with n the number of total labels and 0 the label
for non-keyword. The weights and biases of the deep neural
network, , are estimated by maximizing the cross-entropy
training criterion over the labeled training data {x j , ij}.
For faster training on mobile devices, transfer learning
(or pre-training) can be used. [5] Transfer learning refers to
the situation where (some of) the network parameters are
initialized with the corresponding parameters of an existing
network, and are not trained from scratch.
The network is trained using backpropagation with
gradient descent on a training set containing consisting of
wav files, for both positive (files containing the keyword) and
negative examples.
The wav files are standard 1-channel, Int16, 44k bitrate
files.

Advanced methods in data analysis


Andrei Ciceu
gr. 246

Comparison with other systems

Experiments have been performed on a data set with


real voice search queries as being the negative examples and
the keywords as positive examples. It has been compared
with a similar HMM KWS system.
In terms of accuracy, the DNN KWS system outperforms
the standard HMM KWS with a 45% relative improvement
(which signifies a high change).
In terms of performance, the DNN system only has about
244M parameters, whereas the HMM system has about 373M
parameters.
The DNN system outperforms the HMM one in both quiet
and noisy conditions. [2]

Conclusions
Training this network from a random-initialized state
proved to take a very long time. Ive left it overnight to train
itself, and used a relatively small dataset (20 wav files, of 1s
each), over 300 epochs. It took 10 hours for the training to
finish, and the results werent satisfying. As in the [2]
document was written, theyve used a much larger dataset
(20 000 audio files), and pre-trained some layers of the
network using speech recognition software.
My software has been able to recognize simple sounds
(like the Romanian e) or a clapping sound. For more
complicated keywords (like Ok google or Unique), more
training time is needed.
With more processing power and more time, this
software can possibly be trained to accurately recognize the
keywords.
7

Advanced methods in data analysis


Andrei Ciceu
gr. 246

Starting the program


To start the program, all libraries must be included
(check the links in the bibliography for a list of all the
libraries), or a package with all the needed libraries like
Anaconda 2.7 (!) can be installed.
In the file trainingSet.txt is a list with all the training
data that will be used for training. If the file dnn.xml exists,
the network will be loaded and it will not be trained again.
Go to a console and type python main.py to run the
program. When detecting a keyword, the program will
announce you that it has heard it, by printing a string to the
console.

Advanced methods in data analysis


Andrei Ciceu
gr. 246

Bibliography
1)

http://pybrain.org/

2)

Keyword spotting using Deep neural networks


link

3)

https://github.com/tyiannak/pyAudioAnalysis

4)

Bengio, Y.; Courville, A.; Vincent, P. (2013).


"Representation Learning: A Review and New
Perspectives

5)

Schmidhuber, J. (2015). "Deep Learning in


Neural Networks: An Overview"

6)

Using DNNs on GPUs:


http://devblogs.nvidia.com/parallelforall/cuda-spotlightgpu-accelerated-speech-recognition/

You might also like