Software Project Documentation

Advanced methods in data analysis
Andrei Ciceu
gr. 246
Keyword spotting using deep

neural networks
Abstract
The goal of this paper is to present the design and

functionality of the Keyword spotting software I developed
for the Advanced methods in data analysis course. As my
main theme was deep neural networks and deep
learning, in developing this software, Ive used this
concepts. The paper begins by describing the programming
language, the tools and the libraries used in developing this
software. After, I present the structure of the project and the
interactions between modules, along with examples of the
training data used in training the deep neural network. We
end by stating the conclusions and the observations taken
during developing, testing and using the software.
Introduction
The goal of this paper is to present the keyword spotting
software developed for this course, which makes use of deep
neural networks.
Ive chosen this topic because I intend to further use this
approach in other projects, outside of the faculty and because
it is of high interest to me.
I first go over the libraries and tools used in developing
the software. Then, I follow by describing the implementation
and the structure of the project; after, we have a look over
the training data.
We end with the conclusions.
1

Andrei Ciceu
gr. 246

Andrei Ciceu
gr. 246
Python and PyBrain

For developing the software, Ive chosen Python 2.7,
because it allows the programmer to develop the software
fast, with little hassle. It is very good for prototyping and
testing purposes, but it is not suitable for large software, as it
has some performance drawbacks (being an interpreted
language).
PyBrain [1] is a modular Machine Learning Library for
Python. Its goal is to offer flexible, easy-to-use yet still
powerful algorithms for Machine Learning Tasks and a variety
of predefined environments to test and compare your
algorithms. It offers a quick way of designing software that
uses machine learning. By using the numPy library, it also
has pretty good computational times.
Feature extraction
Ive looked over different features that can be extracted
from raw speech or recorded speech (wav files), and Ive
chosen log-mel filterbanks as the features that Ill be using
with the deep neural network. This has been proven to work
[2] in previous studies, and its also the standard used by
other speech recognition software.
For extracting features, the module generates acoustic
features based on 40-dimensional mel-log-filterbank energies
computed every 10 ms over a window of 25 ms. Contiguous
frames are stacked to add sufficient left and right context.
The input window is asymmetric since each additional frame
of future context adds 10 ms of latency to the system. The
3

Andrei Ciceu
gr. 246
DNN KWS approach uses 10 future frames and 30 frames in
the past
For this Ive used the pyAudioAnalysis library [3],
which provides the functions needed to extract the log-mel
filterbanks.
The data fed to the network was normalized, between -1
and 1, to avoid overflows (which were pretty common at the
beginning, with PyBrains float32)
The deep neural network

The deep neural network is a standard feed forward
network and has the following structure:
Input: 1 linear layer, consisting of 1600 units
Hidden: 3 ReLU layers, consisting of 128 units each
Output: 1 SoftMax layer, consisting of labels+1 units
Labels represent the keywords that can be spotted. The
software I developed uses only 1 keyword for detection, but it
can easily be adapted to more if needed. As the last layer is a
SoftMax, it returns confidence scores that sum up to 1.
ReLU units have been proven to have faster training
times.
For being able to retrieve previous trainings, the
structure of the DNN is saved into an xml file after each
epoch of training and can be loaded when starting the
software.
Ive also tried to implement this DNN myself, but it
proved to be worse in terms of performance than using
4

Andrei Ciceu
gr. 246
PyBrain, which makes us numPys fast matrix multiplication
features and other tools.
This is the code used to initialize PyBrains neural
network:
#Initializing layers & network

self.__inLayer = LinearLayer(40*40)
self.__network.addInputModule(self.__inLayer)
self.__hiddenLayers = []
for i in range(0, self.__hiddenLayersCount):
self.__hiddenLayers.append(ReluLayer(self.__neuronsPerLayer))
self.__network.addModule(self.__hiddenLayers[i])
self.__outLayer = SoftmaxLayer(self.__labels)
self.__network.addOutputModule(self.__outLayer)
#Connecting layers
self.__connections = []
self.__connections.append(FullConnection(self.__inLayer,
self.__hiddenLayers[0]))
for i in range(1, self.__hiddenLayersCount):
self.__connections.append(FullConnection(self.__hiddenLayers[i-1],
self.__hiddenLayers[i]))
self.__connections.append(FullConnection(self.__hiddenLayers[self.__hiddenL
ayersCount-1], self.__outLayer))
for i in range(0, len(self.__connections)):

Andrei Ciceu
gr. 246
self.__network.addConnection(self.__connections[i])
self.__network.sortModules()
return False
Training
Suppose Pij is the neural network posterior for the i-th
label and the j-th frame xj, where i takes values between 0,
1 ... n 1, with n the number of total labels and 0 the label
for non-keyword. The weights and biases of the deep neural
network, , are estimated by maximizing the cross-entropy
training criterion over the labeled training data {x j , ij}.
For faster training on mobile devices, transfer learning
(or pre-training) can be used. [5] Transfer learning refers to
the situation where (some of) the network parameters are
initialized with the corresponding parameters of an existing
network, and are not trained from scratch.
The network is trained using backpropagation with
gradient descent on a training set containing consisting of
wav files, for both positive (files containing the keyword) and
negative examples.
The wav files are standard 1-channel, Int16, 44k bitrate
files.

Andrei Ciceu
gr. 246
Comparison with other systems
Experiments have been performed on a data set with

real voice search queries as being the negative examples and
the keywords as positive examples. It has been compared
with a similar HMM KWS system.
In terms of accuracy, the DNN KWS system outperforms
the standard HMM KWS with a 45% relative improvement
(which signifies a high change).
In terms of performance, the DNN system only has about
244M parameters, whereas the HMM system has about 373M
parameters.
The DNN system outperforms the HMM one in both quiet
and noisy conditions. [2]
Conclusions
Training this network from a random-initialized state
proved to take a very long time. Ive left it overnight to train
itself, and used a relatively small dataset (20 wav files, of 1s
each), over 300 epochs. It took 10 hours for the training to
finish, and the results werent satisfying. As in the [2]
document was written, theyve used a much larger dataset
(20 000 audio files), and pre-trained some layers of the
network using speech recognition software.
My software has been able to recognize simple sounds
(like the Romanian e) or a clapping sound. For more
complicated keywords (like Ok google or Unique), more
training time is needed.
With more processing power and more time, this
software can possibly be trained to accurately recognize the
keywords.
7

Andrei Ciceu
gr. 246
Starting the program

To start the program, all libraries must be included
(check the links in the bibliography for a list of all the
libraries), or a package with all the needed libraries like
Anaconda 2.7 (!) can be installed.
In the file trainingSet.txt is a list with all the training
data that will be used for training. If the file dnn.xml exists,
the network will be loaded and it will not be trained again.
Go to a console and type python main.py to run the
program. When detecting a keyword, the program will
announce you that it has heard it, by printing a string to the
console.

Andrei Ciceu
gr. 246
Bibliography
1)
http://pybrain.org/
2)
Keyword spotting using Deep neural networks

link
3)
https://github.com/tyiannak/pyAudioAnalysis
4)
Bengio, Y.; Courville, A.; Vincent, P. (2013).

"Representation Learning: A Review and New
Perspectives
5)
Schmidhuber, J. (2015). "Deep Learning in

Neural Networks: An Overview"
6)
Using DNNs on GPUs:

http://devblogs.nvidia.com/parallelforall/cuda-spotlightgpu-accelerated-speech-recognition/

Software Project Documentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Software Project Documentation

Uploaded by

Copyright:

Available Formats

Advanced methods in data analysis

Keyword spotting using deep

The goal of this paper is to present the design and

Advanced methods in data analysis

Advanced methods in data analysis

Python and PyBrain

Advanced methods in data analysis

The deep neural network

Advanced methods in data analysis

#Initializing layers & network

Advanced methods in data analysis

Advanced methods in data analysis

Comparison with other systems

Experiments have been performed on a data set with

Advanced methods in data analysis

Starting the program

Advanced methods in data analysis

Keyword spotting using Deep neural networks

Bengio, Y.; Courville, A.; Vincent, P. (2013).

Schmidhuber, J. (2015). "Deep Learning in

Using DNNs on GPUs:

You might also like