You are on page 1of 27

End-to-End Automatic KUNAL DHAWAN

Speech Recognition KUMAR PRIYADARSHI


Connectionist Temporal
Classification: Labelling
Unsegmented Sequence Data
with Recurrent Neural Networks

ALEX GRAVES ET AL.; ISTITUTO DALLE MOLLE


DI STUDI SULL’INTELLIGENZA ARTIFICIALE , SWITZERLAND
APPEARED IN PROCEEDINGS OF THE 23 INTERNATIONAL CONFERENCE ON
MACHINE LEARNING , PITTSBURGH, PA. YEAR:2006
Problems with HMMs

 Require significant amount of task specific knowledge for training (


eg designing state models for HMMs)
 Require explicit (and often questionable) dependency assumptions
to make inference tractable (e.g. the assumption that observations
are independent for HMMs)
 for standard HMMs, training is generative[model joint distribution],
even though discriminative models generally outperform generative
models in classification tasks
Solution

 Recurrent neural networks (RNNs) are powerful sequence learners,


but cant be directly used for speech recognition because they
require pre-segmented training data and post-processing to
transform their outputs into label sequence
 Here comes CTC: helps us train a Recurrent Neural network from
unsegmented audio data directly
 A key difference between CTC and other temporal classifiers is that
CTC does not explicitly segment its input sequences. This has several
benefits, such as removing the need to locate inherently ambiguous
label boundaries (e.g. in speech or handwriting), and allowing label
predictions to be grouped together if it proves useful (e.g. if several
labels commonly occur together). In any case, determining the
segmentation is a waste of modelling effort if only the label
sequence is required
Example
 As shown in the figure, CTC
simply tries to model the
sequence of labels , not
caring about the alignments
with the audio data. This
helps you avoid ambiguous
word alignments and also
give a better representation if
multiple tokens occur
together (output is just a
softmax over all possible
labels)
But how is it done

 We introduce an additional blank symbol in addition to the possible


labels that the recurrent neural networks can output.
 This additional blank symbol give freedom to the RNN to give the label
for a section of input at any moment, and especially when it is sure of its
answer, simply by outputting the blank label the rest of the time.
 As the output sequence is shorter than the input sequence, there are
many possible alignments with the correct label sequence. We want
the recurrent neural network to learn one of these correct alignments
on its own. Using dynamic programing to sum over all the possible
alignments, CTC provides gradients for the backpropagation phase to
train the RNN to learn a good alignment
 [CTC uses a dynamic programming based forward backward algorithm
very similar to HMM training ]
CTC blank symbol and collapsing
Mathematics behind CTC

 input sequences X=[x1 ,x2 ,…,xT ]


 Corresponding output sequence: Y=[y1 ,y2 ,…,yU ]
 What we want: Y*=argmax_Y p(Y∣X)
 The CTC algorithm can assign a probability for any Y given an X. The
key to computing this probability is how CTC thinks about alignments
between inputs and outputs. Looking at these alignments:
 The CTC algorithm is alignment-free — it doesn’t require an
alignment between the input and the output. However, to get the
probability of an output given an input, CTC works by summing over
the probability of all possible alignments between the two.
 One simple way of alignment can be to assign an output character
to each input step and collapse repeat . Eg

 This approach has two problems:


1)Often, it doesn’t make sense to force every input step to align to
some output. In speech recognition, for example, the input can have
stretches of silence with no corresponding output.
2)We have no way to produce outputs with multiple characters in a
row. Consider the alignment [h, h, e, l, l, l, o]. Collapsing repeats will
produce “helo” instead of “hello”.
 To get around this problem , CTC introduces a new token called as
blank token ( ϵ )
 Now , the alignments allowed by CTC are the same length as the
input. It allows any alignment which maps to Y after merging
repeats and removing ϵ tokens
The CTC alignments give us a natural way to go from probabilities at each time-step to
the probability of an output sequence

Softmax output
probability
 Thus , the CTC objective for a single (X,Y) pair is:

Models trained with CTC typically use a recurrent neural network (RNN) to estimate the
per time-step probabilities, pt​(at∣X)

The CTC loss can be very expensive to compute. We could try the straightforward approach
and compute the score for each alignment summing them all up as we go. The problem is
there can be a massive number of alignments and for most problems this would be too slow.
Thankfully, we can compute the loss much faster with a dynamic programming algorithm. The
key insight is that if two alignments have reached the same output at the same step, then we
can merge them
Implementation

 Currently available as the cost function in most of the end-to-end


speech processing toolbox like esen.
 There are multiple paralized and fast implementation for the same
which have been open sourced, the most prominent one being
baidu research's warp-ctc
 Conclusion: performs as good as and sometimes even better than
conventional HMM systems and allows us to use the sequence
modelling power of RNNs and LSTMs even on non-aligned speech
data, acting as a perfect fit in a end-to-end system
Sequence Transduction with
Recurrent Neural Networks

ALEX GRAVES, UNIVERSITY OF TORONTO, CANADA


ARXIV 2012
Motivation

• Transduction : Transformation of input sequences into output sequences for


various machine learning tasks e.g., speech recognition, machine translation
etc.
 A major challenge is to make learning of input and output representation
invariant to sequential distortions such as shrinking, stretching and translating.
 RNNs are capable of learning such representations but they traditionally
require a pre-defined alignment between the input and output sequences
which is itself a challenging task!
 This paper presents an end-to-end, probabilistic sequence transduction
system, based entirely on RNNs, that is in principle able to transform any input
sequence into any finite, discrete output sequence.
Why RNNs?

 Recurrent neural networks


(RNNs) are a promising
architecture for general-
purpose sequence
transduction. The
combination of a high-
dimensional multivariate
internal state and nonlinear
state-to-state dynamics
offers more expressive power
than conventional sequential
algorithms such as hidden
Markov models.
RNN Transducer: Extension of CTC
 CTC defines a distribution over all alignments with all output sequences not
longer than the input sequence (Graves et al., 2006). However, as well as
precluding tasks, such as text-to-speech, where the output sequence is
longer than the input sequence, CTC does not model the interdependencies
between the outputs!

 The transducer described in this paper extends CTC by defining a


distribution over output sequences of all lengths, and by jointly modelling
both input-output and output-output dependencies.

 RNN transducer combines a CTC-like network with a separate RNN that


predicts each phoneme given the previous ones, thereby yielding a jointly
trained acoustic and language model.
The RNN Transducer - I
 Let x = (x1, x2, . . . , xT ) be a length T input sequence.
 Let y = (y1, y2, . . . , yU ) be a length U output sequence belonging to the
set Y* of all sequences
 Define the extended output space Y¯ as Y ∪ ∅, where ∅ denotes the null
output.
 Given x, the RNN transducer defines a conditional distribution
Pr(a ∈ Y¯* |x).
 This distribution is then collapsed onto the following distribution over Y*:
The RNN Transducer - II
 Two recurrent neural networks are used to determine Pr(a ∈ Y¯* |x).
 The transcription network F, scans the input sequence x and outputs the
sequence f = (f1, . . . , fT ) of transcription vectors .
 The other network, referred to as the prediction network G, scans the
output sequence y and outputs the prediction vector sequence g = (g0,
g1 . . . , gU ).
The Prediction Network
 The prediction network G is a LSTM network consisting of an input layer,
an output layer and a single hidden layer.

 Given yˆ, G computes the hidden vector sequence (h0, . . . , hU ) and


the prediction sequence (g0, . . . , gU ) by iterating the standard LSTM
equations from u = 0 to U.

 The prediction network attempts to model each element of y given the


previous ones; it is therefore similar to a standard next-step-prediction
RNN, only with the added option of making ‘null’ predictions.
The Transcription Network
 The transcription network F is a bidirectional RNN that scans the input
sequence x forwards and backwards with two separate hidden layers,
both of which feed forward to a single output layer. Bidirectional RNNs
are preferred because each output vector depends on the whole input
sequence (rather than on the previous inputs only, as is the case with
normal RNNs).
 The transcription network is similar to a Connectionist Temporal
Classification RNN, which also uses a null output to define a distribution
over input-output alignments.
Output Distribution
 Given the transcription vector f_t, where 1 ≤ t ≤ T, the prediction vector
g_u, where 0 ≤ u ≤ U, and label k ∈ Y¯, define the output density function:

 where superscript k denotes the k_th element of the vectors. The density
can be normalised to yield the conditional output distribution:
Output Distribution -
Interpretation
 Pr(k|t, u) is used to determine the transition
probabilities in the lattice shown . The set of
possible paths from the bottom left to the
terminal node in the top right corresponds to the
complete set of alignments between x and y,
i.e. to the set Y¯∗ ∩ B−1 (y). Therefore all possible
input-output alignments are assigned a
probability, the sum of which is the total
probability Pr(y|x) of the output sequence given
the input sequence. Since a similar lattice could
be drawn for any finite y ∈ Y∗ , Pr(k|t, u) defines
a distribution over all possible output sequences,
given a single input sequence.
Training
 Calculation of Pr(y|x) from the lattice is performed using an efficient
forward-backward algorithm described in the paper.
 Given an input sequence x and a target sequence y* , the natural way
to train the model is to minimise the log-loss L = − ln Pr(y* |x) of the target
sequence. It is done by calculating the gradient of L with respect to the
network weights parameters and performing gradient descent.
 The phoneme error rate of the transducer
is among the lowest recorded on TIMIT.
 The advantage of the transducer over the CTC
network on its own is relatively slight. This may be

Results and because the TIMIT transcriptions are too small a


training set for the prediction network.

Conclusions
Thank you!

You might also like