You are on page 1of 42

1.12.

05
Supervisor: Dr R. Mannell.
Topic. Using Wavelets to Acoustically Analyse Oral Stops.
Richard Mullins.
INDEX
2
POSTSCRIPT
4
BACKGROUND
5
FORMANT TRACKING
6
BASELINE SYSTEMS
6
(1) HMM
7
(2) SVM
8
WAVELET ANALYSIS
9
HISTORY OF THE PROJECT
10
INTRODUCTION
15
LITERATURE REVIEW
19
EXPERIMENTS.
19
EXPERIMENT 1. Matching raw waveform.
20
EXPERIMENT 2. Matching raw waveform.
20
EXPERIMENT 3. Matching spectrum.
20
EXPERIMENT 4. Kalman filtering.
20
EXPERIMENT 5. Recognising speech using SVM.
21
EXPERIMENT 6. Recognising speech using H2M (a HMM system).
21
EXPERIMENT 7. Recognising speech using H2M with wavelet data.
21
EXPERIMENT 8. Testing a HMM asr system for continuous speech.
22
EXPERIMENT 9. Butterworth filter bank.
22
EXPERIMENT 10. Formant synthesis.
23
EXPERIMENT 11. Supplying printouts of spectrograms.
23
EXPERIMENT 12. Classification of phonemes from wavelet data using R.
24
EXPERIMENT 13. Electric field metaphor.
24
EXPERIMENT 14. Wavelet spectrogram from M. Johnsons toolkit.
26
EXPERIMENT 15. Ad hoc method to detect /s/.
26
EXPERIMENT 16. Method to detect stops.
27
EXPERIMENT 17. Using nsltools.
27
EXPERIMENT 18. Formant tracks.
28
EXPERIMENTS NOT DONE
28
N1. Calculating statistics.
28
N2. A feature for labial stops.
28
N3. SVD analysis of DWT data.
28
N4. Using wavelets to match phonemes.
28
N5. Grid model of vocal tract.
28
N6. Model of area function sequence.
30
REFERENCES
33
APPENDICES
33
APPENDIX 1. Comments on the project.
35
APPENDIX 2. Demonstration of asr using SVM.
35
APPENDIX 3. Matching speech.
35
APPENDIX 3.1 Matching raw waveform.
34
APPENDIX 3.2 Matching MFCC data.

35
37
37

APPENDIX 4. What is a wavelet?


APPENDIX 5. What is acoustic analysis?
APPENDIX 6. Examples of CWT.

POSTSCRIPT.
Only a system that does asr on untrained speech can meet our requirements for a
phonetic analysis device. It seems quite possible that existing methods such as HMM
can be used to do this. It seems desirable to have an automated system which can be
fed multiple diverse transforms of data as training input and as test input. At present
there is nothing freely available in Matlab which will do unrestricted asr.
We wish to specify a function which maps waveform to phonetic symbol. We do not
need to start from nothing, in drawing up a list of functions for processing speech.
We can draw on centuries of experience in using mathematical functions.
Gallardo-Antolin et al (2003) report good results using wavelet processing instead of
MFCC for single word recognition. (There are many reports on the literature of using
wavelets in recognising hand-segmented phonemes). If we have a system that can
recognise single words, we can conduct experiments recognising which item in a
minimal pair was spoken, so this might be accepted as an experiment in acoustic
analysis.
As work done for the SLP813 project, an SVM (support vector machine) was used to
classify handlabelled data from ARCTIC database. The system appeared to give a
usable transcription (one phonetic label per frame) for a speaker used in training. For
a speaker not used in training, the transcription was quite unreadable. It is quite
possible that the results could be improved by training many SVMs, one for each
feature. No attempt was made to use wavelet data to train the SVM at this stage, as I
have continued instead to seek for a better baseline system.
As evidence that other people are doing related work, one notes that Rossi (2005)
used an SVM with frames of raw wavelet data to classify two vowels from the TIMIT
database. They also did the same experiment using frames of raw speech data.
Long and Datta have identified phonemes in handsegmented data, using wavelets.
Several wavelets would need to be used to identify different classes of phones.
If a device trained on hand segmented data does not succeed in labelling hand
segmented data correctly, it is perhaps unlikely to be successful when run on
unsegmented speech. So classification experiments using handlabelled phonemes are
probably a good way to sift out ideas that will not work in asr.
For the SLP813 project, the recogniser in Becchettis HMM asr system was tested. It
gives very good results, using only a phonetic model, on the sample data provided. I
wanted to change the system to use wavelets, but was unable to work out how to do a
training run of the existing system unless a training run can be done there is no point
in working on preparing data for a training run.

For the SLP813 project, a single word utterance asr system using HMM was obtained.
(This is a demonstration program in Cappes H2M system). I modified this to use
wavelet data, and the results were not as good as the original system which used
cepstral data. This result is consistent with Long and Dattas finding, that we need to
use several wavelets, because a single wavelet is not suitable for recognising all
phonemes. A suggested improvement would be training several HMMs (perhaps
using different wavelets) and polling the results. But perhaps a simpler improvement
would be to train separate HMMs, each to recognise a particular feature.
Dan Ellis has a DTW (dynamic time warping) system which uses STFT frames. For
the SLP813 project, this was tested on very noisy speech. It had some success in
recognising syllables, but not in recognising pieces of spech of, say 0.7s. For this
reason, I did not continue with this system to adapt it to use wavelet frame data.
However, this experiment could be redone using clearer speech.
This report is not concerned with the details of stop data, for example that such and
such a stop has such and such an average voice onset time. Hand labelled data has
been used for training. This does not mean that it is irrelevant to be concerned with
the phonetics of stops in a language, only that this was not an issue in the experiments
I was doing, because I was trying to use automated methods (which would do the
analysis whatever accent or language they were presented with).
It would have been useful and interesting to build a voice coach which could
compare ones voice to targets for each phoneme. The above SVM system could be
used to give this feedback, by displaying on the screen the systems transcription for
ones speech, as one was speaking. This path has not been developed because I have
limited myself this year to using Matlab, and without the Data Acquisition toolbox,
the microphone freezes in Matlab, making it impossible to do interactive experiments
using the microphone.
Nijogi and Sondhi (2001) have written on design of a filter which recognises a stop in
continuous speech. They obtain 3 coefficients at 1000Hz rate, (i) energy (ii) energy
over 3KHz (iii) Wiener entropy, which is the integral of the log of the spectrum minus
the log of the integral of the spectrum. Calculations are made over a 5ms window, and
Thomsons multi-taper method is used for computing the spectrum. They obtained a
16% error rate for detecting stops in TIMIT, with a 33 coefficient linear filter which
was trained on 4 speakers using 10 sentences each from TIMIT. The detector provides
an output (1 or 0) every frame, i.e. every millisecond. They say that this type of
approach could be used for recognition of other features, e.g. nasality.
From the point of view of automation, it would be of interest to know if the
processing method obtained by Nijogi and Sondhi could be derived automatically.
P. Niyogi and C. Burges (2004) use an SVM to detect stops using the same channels
of energy and Wiener entropy data used by Nijogi and Sondhi. This means that they
have automated part of the processing, because the SVM is handling the step which
previously involved them devising an automated method to design the filter. To take
this idea further, would it be possible to obtain results without having to decide what
type of data to present to the SVM? The only suggestion I can make is that by
presenting a sufficiently diverse variety of transformed data, the system may be able
to take the coefficients it needs. As reported briefly later in this report, I obtained the

nsltools system (Centre for Auditory and Acoustic Research). This converts waveform
to a 4-d collection of wavelet data. It is possible that by using all this as raw data, we
would have the right training data we need. For the SLP813 project, no attempt was
made to train an SVM with data obtained from nsltools processing.
Rossi and Villa (2005) used an SVM to classify two vowels, using the raw data (256
points) and wavelet data (256 points). They do not say what kind of wavelet
processing was used, and maybe it does not matter. If so, this would be that SVM is a
superior method to the classification method using wavelet data reported by Long and
Datta, where different wavelets were needed depending on the type of phoneme.
For the SLP813 project, an SVM has not been run with wavelet data. Experiments
similar to Rossi and Villa could have been repeated, but this was not an objective of
the project. I had much more demanding requirements than this because I was
attempting to use an SVM to decode unsegmented untrained speech. Results were
unsuccessful, using 11 MFCC coefficients. I attempted the experiment, trying to train
an SVM to match a single feature this was even less successful as it could not
recognise speech used in training the experiment should be repeated and reviewed
to understand why it did not work.
O. Schwartz and E. Simoncelli (2001) weight the output of each filter in a filter bank,
by dividing by the weighted sum of the rectified response of other filters. They choose
the weights to maximize the independence of the normalized responses for an
ensemble of natural sounds. This idea could of course be applied to a wavelet filter
bank.
M. Blomberg and K. Elenius (1981) used zero crossing rates in bands to distinguish
vowels. We can take wavelet data to be an approximation to derivatives, and hence it
might be expected to be able to give similar results to zero crossing data.
BACKGROUND.
STFT (short time Fourier transform) is a very widely used method. It is equivalent to
a filter bank method. The wavelet transform is also equivalent to a filter bank method.
(For efficiency, usually wavelet transforms are not implemented directly as a filter
bank, but as a cascade of filters running at different rates. Lower frequencies are
processed at a slower rate. For simplicity in this discussion, a wavelet transform may
be thought of as a filter bank. Vidakovic has code on his web site which implements
DWT (discrete wavelet transform) as a matrix multiplication. We could use this
matrix as a filter bank by applying it to the waveform under a stepped window, giving
us a frame of data for each multiplication.
DWT (discrete wavelet transform) uses only filters that are obtained from a single
mother wavelet by scaling and shifting. WPT allows different mother wavelets to
be used within the same filter bank - we are allowed to select mother wavelets that
give the best fit to the data (i.e. can describe it to a specified accuracy with fewer
wavelets).
Beng Tan et al (1996) report that sampled continuous wavelet transform gave slightly
better results than MFCC coefficients, in phoneme recognition experiments.

Haar and Young used speech data as the original mother wavelets in using wavelet
transforms to extract features from stops. The processing showed little benefit.
It would seem to be an extension of this idea, to allow the filters in a wavelet
transform to be time varying. This would, for example, allow a cascade of filters in a
formant speech synthesiser filter bank, to be described as a wavelet transform. This
suggestion has not been seen in the literature. But predict and adjust steps are
already used in Sweldens methods for building wavelet filter banks, so it would not
seem to be a huge step to allow adjustment to the filters. A reason for wanting to use
time varying filters, is that it is recognised (e.g. Long and Datta) that different filters
give better results for analysing different phonemes, so it would be reasonable to
extend this to using time varying filters to analyse (and generate) speech.
Kingsbury has written extensively on wavelets. In http://cnx.rice.edu/content/m11138/latest/
he says The word wavelet refers to the impulse response of the cascade of filters
which leads to a given bandpass output. This could be interpreted to say that when
we have a cascade of filters, we have wavelets. If so, this would imply that we have
wavelets when we have a formant synthesiser which contains a cascade of filters. This
is at least some support for my suggestion above that such a cascade of filters may
come to be called a wavelet transform.
FORMANT TRACKING.
Speech has a time varying spectrum. It seems very reasonable to attempt to model the
time varying spectrum in terms of components. One classic model is that the speech is
a convolution of a source and a filter. Analysis by synthesis experiments should be a
way to learn about the features in speech. The Fortran code used in M. Wagners
formant synthesiser was used to synthesise some speech. Analysis by synthesis would
be a way of automatically tracking formants, and would give insight into the analysis
of speech. But adapting the code to do analysis by synthesis has not been done for the
SLP813 project.
STFT analyses speech as sinusoids of fixed frequency. From this we can infer formant
tracks. Kamran Mustafa (2003) has written on formant estimation using an adaptive
filter bank (i.e. time varying filters). Mustafas paper does not contain the word
wavelet so it is unlikely that he uses wavelets in his method. Ridsdill has written on
how to implement time varying filters using wavelets. Ridsdill works in the area of
geology. It seems desirable to implement Ridsdills ideas with a view to modelling
phonemes using time varying filters. (Of course, it also possible to model phones
using a non time varying filter bank).
Chirplets have been used by ONeill to decompose a waveform into chirps. We can
accept this as a wavelet method, even if some people say that, strictly, chirplets are
not wavelets. His code is on the Matlab file exchange web site, but the code does not
work on the current version of Matlab. If this code has worked, it might have
provided better views of the formant transitions at the onset and release of stops.
Because chirplets are time varying frequencies, they may highlight time varying
formant data better than STFT (or an arbitrary choice of wavelet processing) can do.

BASELINE SYSTEMS.
I have attempted to find a baseline system which does not used wavelets methods.
This could then be adapted to use wavelets, or at least used to give results that could
be compared with what was obtainable using a wavelet method.
(1) HMM
The HMM system called H2M, by Cappe, contains a demonstration program for
single word asr. Data from the handlabelled SHATR database on internet was
extracted and used in a run of theprogram.
The program was then adapted so that it used CWT (continuous wavelet transform)
coefficients instead of cepstral coefficients. The frame coefficients used were the total
energy over the frame in each channel of CWT data. The results were not as good as
when cepstral data was used.
It would be possible to continue with this system and refine it, by experimenting with
various statistics derived from the CWT data, and by experimenting with different
wavelet transforms. It may not even be possible to find a particular wavelet transform
which will give better results than the cepstral data. But this does not prove that
wavelet processing cannot be used in this experiment. All it means is that this type of
simple minded approach is ineffective. It would also be possible to extend these
experiments by training separate HMMs for features based on classes of phonemes
e.g. train an SVM to recognise stops versus non-stops. Kirchhoff has implemented a
HMM system based on features which gave good results.
Becchettis asr system uses HMM. The demonstration runs from exe files. It was
verified that the programs used in the demonstration could be compiled from source
code and run. The system recognised the sample data perfectly or almost perfectly,
using only a phonetic model, However, the results using my own data, were very
poor. I was unable to work out how to do a training run for Bechettis system. Without
knowing how to do this, there was no point in trying to find out how to adapt the data
processing to use wavelet coefficients. With more time, no doubt a solution could be
found to getting the system to work on new utterances, and in testing wavelet data for
the HMM.
Attempts to work with the HMM systems HTK and CSLU. Did not succeed. In
neither case was I able to complete tutorials. One of the HTK tutorials is extremely
thorough, and consists of hundreds of pages, but I did not have the speech data to go
with it, and solving this problem alone put it beyond the scope of the project.
(2) SVM
In the absence of a continuous speech HMM system, an SVM was obtained from the
net and trained using 1000 hand-labelled utterances from the ARCTIC database. The
output from the system consists of phonetic labels for frame data.

This is an example of what was obtained (from part of the utterance arctic_b0501.wav
in ARCTIC). The speaker was used in training, but the utterance was not used in
training:
I said was decoded as
ih ay ay ay ae ae ay eh eh eh iy
sssssssss
eh eh eh eh eh eh eh eh ih
d
using labels from the alphabet that ARCTIC used.
The result looks good, and from this it is believable that an automated method could
be devised to convert frame transcriptions to an ordinary phonetic trasncription.
However, the system was quite unsuccessful in obtaining a readable output from a
speaker who was not used in training. It is possible of course, that we could find a
way to convert the unreadable transcription to something readable. But no such way
has been found.
A more promising approach to decoding the new utterance might be to use many
codebooks. A suggestion is that we could attempt to train a codebook to recognise a
vowel / nonvowel distinction. We could have another codebook to recognise
nasal. Another possibility is to train a codebook using vowel data only (by training
it only on frames that were vowels in the hand labelling) and use this to transcribe
speech. If preliminary results are successful, we might train SVMs for many different
features. These experiments have not been done.
Until they are done, I cannot proceed with the SVM. I see no point in arbitrarily
deciding to train an SVM using wavelet data, unless we have a working baseline
system. It is highly desirable that the baseline system, to be useful for my purposes,
be able to recognise speech from new speakers.
Another approach is that of F. Rossi and N. Villa (2005), Classification in Hilbert
Spaces with Support Vector Machines. They use an SVM with wavelet data, to
classify /aa/ and /ao/ examples in the TIMIT database.
WAVELET ANALYSIS.
Wavelet analysis can be precisely described by smoothing a signal, and then
iteratively smoothing the smoothed signal to obtain a sequence of N smoothed
signals. We can also have a denser and more redundant sequence obtained by also
smoothing the residual signals, to obtain 2^N -1 new signals. If the smoothing is the
mean of two adjacent signals, we have the Haar wavelet transform. If some other
finite difference formula is used to compute the smoothed signal, we get some other
wavelet transform.
Some writers have said that wavelets are equivalent to finite differences. This is a
big statement because it is hard to restrict finite differences to anything less than the
the whole of mathematical analysis, in a form suitable for computing. Theory of finite
differences can be based on Taylors theorem which requires that functions be
infinitely differentiable.

The fact that we can get wavelets out of smoothing a signal suggests that there are
wavelets, whether we know it or not, and perhaps, whether we like it or not, in any
system that has a feedback loop for example a Kalman filter, or any system that uses
an approximation method like Newtons method. This is why I think it will come to
be recognised that a formant synthesiser is in fact an example of a wavelet transform.
Sweldens showed how to construct wavelets using predict and adjust steps. This
too, seems a very general approach, and suggests to me that wavelets are going to
appear where we have predict and correct steps, such as in Newtons method of
approximation, or in a signal processing module with a feedback loop.
Long and Datta have said that different wavelets are useful in analysing different
classes of phonemes. The type of analysis they suggest can be done by searching a
library of wavelets for the best match. There is code relevant to this in Matlab, but the
functions do not work in my student version.
Following Long and Datta, we could process speech using a number of different
wavelets. Let us consider, for example, processing speech using 64 different wavelets.
Let us obtain 64 coefficients from each wavelet. We could then process all this data
(2^12 items per frame) by processing using SVD (singular balue decomposition).
Perhaps we can think of SVM (support vector machine) as an ad hoc approach that is
similar to this. It may be that an SVM is giving similar results to what we would get
by using wavelet data. Some training method require exponential amounts of data
in order to be run successfully with many coefficients of training data, but it is said
that SVM is less subject to this restriction. I have attempted to use only between 11
and 33 coefficients in my experients. But Rossi and Villa, above, used all 256 wavelet
coefficients in their frame data for an SVM experiment.
There is a connection between correlation coefficient and wavelet analysis. Beng Tan
(1996): Wavelet coefficients are obtained by computing the correlation between each
wavelet and the signal. A single level wavelet analysis consists of an output
consisting of a sequence of numbers. Each number is the dot product of the analysing
wavelet and the waveform under the stepped window. A correlation coefficient is
similar to a dot product, but the length is normalised. It is equal to the cos of the angle
between the vectors in n-space.
Let us consider frames of data, whether obtained by wavelet transform, or some other
method such as STFT. Let us arbitrarily consider a track which consists of a frame
coefficient for a number of frames, say 10 frames. We could consider the correlation
coefficient between this track, and other tracks, which could be lagged versions of the
same track, or lagged or non-lagged tracks of other coefficients. Empirically, we could
find that, for certain features, some pairs of tracks have high correlation coefficients.
Processing all this data could be a way to recognise features in speech.
This means that even without using the term wavelets, we have a method similar to
wavelet analysis when we consider correlation coeffficients. (The output of a single
level wavelet analysis is a similar result to taking a piece of waveform, such as the
above sequence of coefficient values I called a track, and computing its dot product
with another waveform under a stepped window).

We see above, that we can generate more and more complex experiments. It is
possible that as experiments become more complex, running articulatory models will
become an attractive alterative at present these could be too expensive to run, but
computer costs will continue to decrease.
HISTORY OF THE PROJECT.
It was suggested to me in 2002 that one needed to first find a topic to work on, and
then to look for tools. However, without knowing what tools are available, how can
one know what a suitable topic is? I selected the wavelet topic arbitrarily, as only two
topics were available at that time in the acoustics area and we were strongly advised
(at least on one of the Macquarie web documents) not to make up our own topic.
It would be possible to interpret the topic as developing an articulatory synthesiser
using wavelet methods to solve the mathematical equations for a model of speech or
stops in particular. This counts as acoustic analysis because it is unmistakably
acoustics as the term is understood in applied mathematics. This interpretation,
because of the amount of work required, would have precluded any examination of
any other interpretations of the topic.
However, this project has taken a broader interpretation of the topic.
In 2002, VB6 was used to build a filter bank using low order Butterworth filters. For
comparison, Evan Ruzanski (2003) presented perfect reconstruction filter banks using
fifth order Butterworth filters. Selesnick gave a unified treatment of Butterworth
filters and Daubechies wavelets. Butterworth filters were chosen because this appears
to be the method used in the formant synthesiser written by M. Wagner in 1976. The
filter output was squared and smoothed. This obtains smoothed spectrogram data.
(A plot of smoothed spectra abstracted from the smoothed spectrogram, is given later
in this report).
Many of the wavelet functions for R, available on the net, were tested. I also did some
experiments using java.
In 2005 I have downloaded Matlab speech processing funtions from the net, and also
wavelet functions. Many Matlab programs do not work in some cases, this is
believed to be due to bugs in the student edition, which is unsupported by the supplier
in other cases, it is due to incompatibilities with previous versions of Matlab.
A more serious approach to the topic probably indicates using the full version of
Matlab.
In hindsight, the best approach to the topic could have been to display wavelet
spectrograms, and display statistics derived from them, as a display in a voice coach
system, e.g. for someone wanting to learn to pronounce a new language. The current
tools being used did not facilitate interactive use of the microphone (it freezes) and as
student Matlab is not supported, the only known way of continuing would have been
to buy the Data Acquisition toolbox which presumably resolves the problems.
However, it is too late now to develop this approach. Without interactive use of the
microphone, I was not encouraged to develop a voice coach. From the users point of

view, immediate feedback is essential, so they can see results while they are still
speaking or immediately after they have spoken.
INTRODUCTION.
The concept of wavelets emerged in the last half of the 1900s. The mathematics to
describe them is what was taught as analysis or advanced calculus. The word was
used in geophysics in the 1940s. Similar ideas were known in quantum mechanics.
I studied generalised harmonic analysis in 1964, with B. Ninham at UNSW. I cant
remember if the word wavelet was used it does not appear in the book by
Lighthill which we used.
Fourier analysis is used in applied mathematics. Gabor introduced the windowed
Fourier transform in the 1940s. With the general availability of computers from the
1960s, the Fourier transform was widely used. The STFT (short time Fourier
transform) obtains a spectrogram as a sequence of spectra, by doing the Fourier
transform over a stepped window.
The Fourier transform uses an orthogonal basis of sinusoidal analysis functions. Other
orthogonal bases are known for example Chebyshev studied orthogonal basis
functions in the 1850s.
Mexican hat wavelets were used by Marr (1980) to develop a theory of vision
processing. In 1990, Ray Kurzweil estimated that it could take two billion PCs to
duplicate the processing of the human visual system. He based his estimate on
assuming that the eye used Mexican hat wavelets in its processing. He believed that
by the year 2037 a single PC would have the power of two billion PCs of the year
1990. From Kurzweils estimate, millions of todays PCs would be needed to
duplicate the processing of the human visual system. We might guess, then, that many
thousands of PCs, at least, would be needed to duplicate the processing of the human
hearing and speech system.
The wavelet transform is equivalent to a filter bank, as is STFT. The wavelet
transform analyses higher frequencies over a shorter time window, so they can show
detail that is not evident in a given STFT, say with a window of .01s. But this of
course does not mean that information is missing from the STFT, it is just that it is
coded in a convolution form.
For the purposes of this report, speech can be represented as a 1-d waveform. For
example, Emu data is 16 bits at 20KHz. The Fourier transform analyses a waveform
by decomposing it into a sum of complex sinusoids. The STFT (short time Fourier
transform) analyses a waveform by dividing it into (perhaps overlapping) pieces and
using the Fourier transform to analyse each piece. The wavelet transform is similar to
STFT, but the window length varies in proportion to the wavelength of the analysis
function.
With STFT, we have a constant windowing function (e.g. a Gaussian) and a number
of different analysis functions (complex sinusoids). With the wavelet transform, we
the windows can be of different widths (depending on which channel we are
analysing, in the filter bank model of wavelet processing). The wavelet transform

10

does not need a separate windowing function, because the we can think of the
windowing function as being part of the analysis function which is something very
roughly like a damped sinusoid. Wavelet transforms are part of a larger field,
consisting of cascades of FIR and IIR filters generally. An even larger field consists of
maps generally, which would include arbitrary neural networks.
The spectrogram itself can be considered to be a form of analysis of the waveform.
(Indeed, displaying the raw waveform itself is a form of speech analysis). This is
readily obtainable, e.g. using Matlab.
Huang (2001) does not list wavelet in the index. However, wavelets are mentioned
on p 259, under Modulated Lapped Transforms. They discuss a filter bank using
orthogonal filters which are sine modulated complex exponentials, which have a
property typical of functions called wavelets, i.e. they can be obtained from each other
by stretching by 2 and scaling them appropriately. This means that point n in wavelet
number N is equal to point 2n in wavelet number 2N, except for a factor of sqrt(2).
The octave is such an important feature of music that it seems reasonable to think it
would play a part in speech. This is one reason why we might think it is useful to
consider a octave decomposition as is given by DWT.
In contrast to this, Fourier transform, analyses a waveform as a sum of sinusoids
which are integral multiples of a base function. However, there is literature on running
the Fourier transform with a warped frequency scale. This would allow us to obtain an
octave decomposition using the Fourier transform.
Many other functions have the property that we can construct an orthogonal filter
bank where the functions are octave scaled versions of a single function. Since the
1980s, functions with this property have become known as wavelets. The Haar
function (a square wave pulse) and the Whittaker sinc function were both discovered
in the early 1900s. Other wavelets, including Daubechies wavelets, were found in the
1980s and later.
In the 1970s it became more widely known that behaviour that looked very complex
could sometimes be produced by a fairly simply described mathematical method.
This was a theory of fractals. It is applicable today to such things as describing
texture in a 2-d image. I know of no attempts to describe the texture of speech
spectrogram data. Some wavelets (e.g. Daubechies 4) are fractals. It is likely that
already wavelets are being used to describe the texture of 2-d data. This would be
relevant to speech analysis if we could show that the texture of a spectrogram is
relevant to how it sounds. Intuitively, it would appear to be relevant. In some wavelet
spectrograms of speech, there is a pattern of holes in the spectrogram associated with
voiced speech the distance between the holes, in both x and y directions, is related
to F0.
There are many ways to analyse speech. It suffices, for this project, to use single
channel waveform data. Two very common forms of analysis are LPC and STFT.
STFT is similar to a wavelet transform, in that both are equivalent to a filter bank.
LPC uses a time-varying IIR filter. Another analysis is formant data: the frequency
and bandwidth of formants. Formant data can be derived from STFT or LPC data.

11

Recently, extended Kalman filtering has been used on speech. This is a matrix
method. LPC, STFT, and wavelet transforms are also able to be represented as matrix
methods.
Other analyses are possible. We could operate a formant synthesiser model to do
analysis by synthesis to discover formant data.
It is also notable that in data communications (as distinct from speech analysis), more
and more complex codings are being developed. As a simple example of something
that has appeared within the last 20 years, an orthogonal basis of wavelets can be used
to code information in a waveform. In data communications (as distinct from speech
analysis) even in the 1970s there was a standard data transmission method which
uses Galois field theory for an error correcting convolution code.
One type of analysis that could be looked at is to use time varying filters to analyse
speech. Kamran Mustafa (2003) does formant tracking using time varying filters. This
area may be already covered by the concept of analysis by synthesis, operating a
formant synthesis model because in a formant synthesis model ones has time
varying filters which are specified by the formant trajectory data.
Picking an arbitrary wavelet method and using it to process the data for a
classification experiment, could be a starting point for working on the topic of the
project. Long and Datta discuss using different wavelet transforms to analysis
different phones. A wavelet transform is an orthogonal basis method. This makes
connections with other orthogonal basis methods one writer uses Kalman filtering,
which tracks a time varying orthogonal basis, to automatically analyse unlabelled
speech. It seems impractical to attempt the method proposed by Long and Datta
unless we have information about what wavelets to use with which phones without
this information we will run out of time with the project incomplete. The best basis
method provides an automated way to fit wavelets to data but the code for this in
my student version of Matlab does not work. In the absence of a platform from which
to carry out an investigation of what wavelets to use to analyse which phones, an
easier to automate approach may be to use SVD to model phones. In fact, it is said
that the wavelet-vaguelette and WPT (wavelet packet transform) methods
approximate SVD.
For the SLP813 project, no extensive experiments were done using SVD to model
phones, but a related experiment was done. MFFC data calculated on 1000 utterances
from the ARCTIC database was used to train a Kalman filter. Separate Kalman filters
were trained for each phone, and then speech synthesis was attempted by taking as
input a sequence of phoneme symbols, and switching to each filter model turn after
chosen time intervals. This gave a result that was extremely poor quality. If the
experiment had worked better, I would have tried to work out how to use the system
to decode speech. This experiment could be repeated with more MFCC coefficients,
or with other coefficients. There is some literature on automated processing of
unlabelled speech using the Kalman filter.
What is needed is a theory of how to design a collection of time varying filters which
are effective in recognising speech. We can take valuable ideas for this, from writings
on design of perfect reconstruction filter banks using wavelets. We can also develop

12

this concept by adapting a system which does analysis by synthesis using a formant
synthesiser.
Let us try to place wavelet processing within vector space theory. If we consider a
vector space, it is clear that all orthogonal bases (in the same space) can be derived by
a rotation of axes. Therefore any wavelet transform (in a given space) can be obtained
from another wavelet transform in the same space, by a rotation in n-space.
Matlab provides access to some examples. However there is no general purpose asr
system freely available in Matlab. The fact that there is nothing ready made and freely
available in Matlab for general asr should caution us against too high expectations for
this project.
In speech analysis we are looking at the general area of maps, in that ultimately we
want to discover a map that converts waveform to a sequence of phoneme symbols.
These are examples of STFT and CWT obtained by running K. Johnsons Matlab
package: C:\policy\sm\smtoolbox\demo403.m, on the start of the utterance the price
range . from the Emu collection.
STFT:

13

CWT:

I. Christov (2004), Multiscale Image Edge Detection, discusses use of wavelets for
edge detection. This approach could be used for looking at spectrograms. In 1992
Mallat and Hwang generalized the Canny approach using wavelets to singularity and
edge detection in signals. Edge detection is part of the general area of pattern
recognition in images obviously this is an approach that could be used in speech
recognition.
Quite complex coding methods for data communications have been developed over
the past 30 years. Even in the 1970s one of the standard methods used a convolution
coding method using a Galois field this was used as an error correcting code. Since
the 1970s more and more complex coding methods have been developed. Some of
these coding methods have used orthogonal functions. The wavelet transform often
used an orthogonal basis of wavelets which all have the same shape (but are shifted
and scaled).
In general, speech processing is a more complex problem than data communications.
Data communications is coding and decoding signals including speech using a known
code. Whereas speech processing includes analysis of speech here we do not, in
general, know the code that is being used.
Wavelets can be used to implement exact reconstruction filter banks. It is tempting to
think that one could adopt the methods (ed.g. Sweldens predict and adjust steps)
used to build wavelets, and use them in operating a complex bank or cascade of time
varying filters. We could still call this wavelet processing. It may be noteworthy in
14

this respect, that Schwartz and Simoncelli (p4 above) have discussed a bank of filters
where the output of each filter is weighted by dividing by the weighted sum of the
rectified output of other filters in the bank. Since the filter bank could be implemented
as a wavelet transform, the use of the term wavelet transform would then evolve to
allow this use of time varying filters to still be called a wavelet transform. This could
mean that in the future, a cascase of filters implemening a speech synthesiser, will
come to be called wavelet processing.
Sweldens developed a way to build wavelets (predict and adjust steps), but
predict and adjust steps are close at hand when we have feedback loops. So it might
be possible to develop a theory of a cascade of filters with feedback loops, using
precisely methods developed by Sweldens.
A major concern among writers on wavelets since the late 1980s was to require that a
wavelet transform uses filters of the same shape. However, from the point of view of
scientific method, more concise theories are to be preferred, but this does not imply
that all filters need to have the same shape. It may turn out that a concise method
which happens to generate filters of different shapes, will give useful results in speech
analysis.
Evangelista and others have written on using wavelets to warp the time and frequency
scales. We can generalise the Fourier transform by using warping functions. Instead
of a kernel function exp(- 2 pi i w t) we could write exp(-2 pi i a(w) b(t)). Here a and
b are arbitrary warping functions. Oppenheim (1971) obtained time warping by
replacing the unit delays of a filter by first order allpass filters.
If one took as a hypothesis, that analysis of speech consists of identifying contrasting
formant tracks in actual data, then this means studying the evolution of F, A, Bw for
each formant track. In tracking this data, we could build time varying filters. Ridsdill
(1999) has written on using wavelets to design time varying filters.
LITERATURE REVIEW.
E. Robinson implemented multi-channel Wiener filtering applied to geophysics in the
1950s and 1960s. He uses the term wavelet to mean a piece of waveform of zero
mean. Robinson was interviewed in 1997. Robinson points out that the word wavelet
goes back to Huygens in 1690.
G. Kaiser (1996), Physical Wavelets and Radar, IEEE Antennas and Propagation
Magazine, Feb 1996. Physical wavelets are acoustic waves resulting from the
emission of a time signal by a localized acoustic source moving along an arbitrary
trajectory in space. Thus they are localized solutions of the wave equation such
wavelets can be used as basis functions to construct general acoustic waves.
Physical wavelets are wavelets in two distinct senses: In the old sense pioneered by
Huygens, meaning localised acoustic waves, and in the modern sense pioneered by
Morlet, Daubechies etc, meaning functions that are all related by translation and
scaling and that form a basis for a vector space of functions.
A finite difference equation for the wave equation has recently been derived
mathematically from Huygens wavelet model. This is equivalent to the transmission

15

line model which uses a finite grid. One reason for looking at the wave equation in
this project is to try to make more of a connection between acoustics, as the term is
understood in physics or maths, and the acoustic analysis in the project topic.
Looking at a complex acoustics treatise (Morse, Acoustics), I could not find a
single line in the text that I could relate to wavelets in the sense of Haar and
Daubechies. But there will be many items in an acoustics text that are related to
wavelets, it is just that the connections may not be obvious. To take one example, D.
Berners and J. Smith from Stanford have written On the use of Schroedingers
Equation in the Analytic Determination of Horn Reflectance. The authors use a
change of variable to convert the Websters horn equation (a vocal tract model) to
Schroedingers wave equation in quantum mechanics. The authors do not mention
wavelets, but other writers have shown that wavelets on an interval can be used to
successfully solve Schroedingers equation. (This means, of course, that wavelets
could be used to treat Websters horn equation. Of course, there may be more direct
ways to treat Websters horn equation using wavelets, than by solving Schroedingers
equation I do not know, and am only reporting here on connections that I have
found in the literature, to show that there are more connections between wavelets and
acoustics than might be apparent).
G. Margrave (1997), Nonstationary filtering: review and update, CREWES
Research Report, Volume 9. p 2. : A powerful solution technique for any linear PDE,
known as Greens function theory, turns out to be a convolution or filtering process.
The impulsive solution is called a Greens function (or impulse response) and the
distributed source becomes the filter. An intuitive example of this Greens function
theory is the propagation of waves through the application of Huygens principle
(Figure 1a). This refers to the fact that a wavefront can be stepped forward in time
(i.e. extrapolated) by considering each point on the wavefront to be an impulsive
source and the new wavefront is synthesized fromthe superposition of all such
sources. Thus the wave is stepped forward by filtering the input wavefield with an
appropriate Greens function, which may be called a Huygens wavelet. (A complete
mathematical description of Huygens principle may be found in Morse and
Feshbach, 1953, p847).
Intuitively, one might make the assumption that speech is a convolutional code. A
convolutional coder steps along the input data, outputting code at each step. The
output and the new state are computed as a functions of the input and state, at each
step. Given a formant synthesiser consisting of a cascade of filters, we can describe
this cascade of filters by using simultaneous equations, hence we can, presumably,
develop a matrix model for the synthesiser. This is why I believe that a speech
synthesiser could be explicitly writtten up as a convolutional code.
SVD is a way of doing convolution.K. Hermus (2004), Signal Subspace
Decompositions for Perceptual Speech and Audio Processing. p 2:

16

One interpretation of the topic is that it is about deriving concise mathematical


equations to specify stops.
Let us speculate that speech can be build from a bank of filters, each filter having a
time-varying amplitude, bandwidth, and frequency.
A wavelet transform is equivalent to a filter bank, with each wavelet filter consists of
a bandpass filter with frequencies of the channels spaced equally in log of frequency,
and the same Q value for each filter, i.e. the width of each channel filter, as a
frequency bandwidth, is constant proportion of the nominal frequency of the channel.
It is not clear, then, that we should be using a fixed filter bank, as a way of
implementing a speech model as time varying channels. It may be that we can get
better results, and a more concise model, by tracking each moving channel.
The possibility looms that extended Kalman filtering, being an orthogonal basis
method, is the correct approach to processing. This could encourage the definition of
wavelet transform to change to include orthogonal bases in general.
I have made the working assumption that acoustic analysis is analysis that is derived
from the waveform. Hence, it includes listening (or viewing) experiments by humans
or machines, that are derived from the waveform. Below is a picture obtained by
running one of the many wavelet resources available free Matlab code from
paos.colorado.edu/research/wavelets.software.html.The picture is a wavelet
spectrogram. It shows the price from the utterance the price range in the Emu
database.

A wavelet transform is equivalent to the operation of a bank of fir filters.

17

M. Manko et al (2001), Tomograms and other transforms, J. Phys. A.: Math. Gen 34,
p 8321-8332. The wavelet transform is the nondiagonal matrix element of a unitary
irreducible representation of the affine group. They mention the Lie algebra of the
affine group. Kisil and others have given a group theoretic treatment of wavelets.
Fourier, Theorie analytique de chaleur, www-fourier.ujf-grenoble.fr/chaleur.html says
that mathematical analysis can at last give the laws of phenomena such as sound.
While no one doubts that Fourier was right, we are still waiting for a complete
explanation of the acoustics of speech.
STFT and wavelet transforms are maps from R1 to R1 x C1. If we consider maps in
the complex plane, then this takes us to the topic of analytic functions which have
been heavily studied for more than 100 years.
In the 1940s Gabor developed the windowed Fourier transform. He used a Gaussian
as the windowing function. The STFT (short term Fourier transform) is a sequence of
windowed Fourier transforms applied over a stepped window. This gives the same
result as a filter bank.
The wavelet transform is similar to STFT but used a window width that varies
inversely as the frequency of the filter bank channel. This means that higher
frequencies are resolved to a shorter time interval.
It is possible to recognise speech (given a large enough database of examples) by
directly matching waveform. However a multi-level approach may be better, where
we roughly match by using some statistics of the waveform, and do further matching
on smaller regions of the waveform. We can see wavelets as an attempt to work out
this idea of multi-level analysis of a signal.
C. Demars (2000), Representations bidimensionnelles dun signal de parole, gives
details of many time-frequency methods. Only a small proportion of these are called
wavelet methods. However over the past 10 years, some writers have suggested that
time-frequency processing in general will come to be called wavelet processing.
Wavelet processing is a vector space method. The wavelet transform consists of
decomposing a waveform as a sum of orthogonal basis functions.
Wavelet over finite fields are already used in data transmission. There are very few, if
any, reports of speech analysis (as distinct from coding for data transmission) using
wavelets over finite fields.
The wavelet transform gives a rotation in n-space. Since the wavelet transform is
specifiable as a matrix multiplication, it is tempting to suppose that one could use a
wavelet transform as a convolution code for speech. More generally, a convolution
coder for speech could be implemented as a composition of convolution codes.
Convolution codes can be error correcting - the only cases I know of where
convolutions codes are error correcting, are convolution codes over finite fields. This
suggests that we might attempt to look at wavelets over finite fields with a view to
their use in speech coding. (Wavelets over finite fields are already documented in the
literature of data communications coding methods). It would be possible in this way

18

to use a convolution code to convert a 10-d representation of binary feature data, to a


lower dimension, such as a 1-d waveform. We could lose detail projecting to a lower
dimension this could show up as errors in the data. But a convolution code could
allow us to reconstruct the original data. There is perhaps no evidence at present that
this is in fact the way speech is coded. Even if it was, without knowing the exact
details of the convolution code used, it would seem impossible to decode. An
introduction to convolution codes would be given by buying the Communications
toolbox for matlab. I have been unable to investigate this area for lack of time.
In data transmission the codes are known in advance. This is not the case for speech.
Nevertheless, it is tempting to hypothesise that speech will ultimately be found to be a
convolution code, in the sense that it will be able to be stated precisely as a coding
method like Morse code, in some high dimensional space.
Mathematical methods appear to be much wider than wavelet methods. However,
some writers have said that wavelets are equivalent to finite difference methods. One
writer said that the continuous wavelet transform is a form of the Cauchy contour
integral. This would mean that wavelets are of very wide scope because it would
seem to equate the theory of wavelets with the theory of functions of a complex
variable.
One writer says that morphological methods are more general than wavelet
processing. No doubt this view is correct. But nobody has a patent on the term
wavelet method or morphological method, and if it turns out to be convenient, the
meaning of a term wavelet processing will change because people will use it in new
ways.
Some writer have made connections between wavelets and Brownian motion. There is
code in the Matlab Wavelet Toolbox for fractional Brownian motion, but the code
does not work in my student version of Matlab.
EXPERIMENTS.
EXPERIMENT 1. Matching raw waveform.
MOTIVATION. If we are interesting in matching waveforms using wavelets, then it is
useful to have some baseline results which do not use wavelets.
METHOD. A /b/ token was taken from a handlabelled database (ARCTIC). This was
then compared to every utterance in the database. The method uses was to use the /b/
waveform as an FIR filter (i.e. as a single level wavelet transform). The maximum
value of the filter output for each sentence in the database was found. The hand label
corresponding to the position at which the maximum occurred, and the maximum
value itself, were recorded. The results were sorted to find the best matches.
RESULT. This method did not succeed in identifying /b/, but it appears that this could
be a successful way to identify the class of stops.

19

EXPERIMENT 2. Matching raw waveform.


METHOD. The above experiment was revised, so that the output was normalised by
dividing by the Euclidean length of the item in the current filter window.
RESULT. Heuristically, this method was very successful in recognising the phone
as /b/. However, results for other phones tested were not as good.
I did not test this method using other speakers.
EXPERIMENT 3. Matching spectrum.
MATERIALS. I used DTW (dynamic time warping) code from Dan Elliss web site
he matches frames of STFT data.
RESULT.This was not successful in trying to match whole words. However, it had
more success in matching speech for a syllable like /ba/, against a speech database
from another speaker speaking a different language. Some of the experiments were
with very noisy speech.
EXPERIMENT 4. Kalman filtering.
Trying to model phonemes using Kalman filtering would appear to have parallels with
experiments using wavelets. The reason for this is that the wavelet transform is an
orthogonal basis method, and extended Kalman filtering tracks an orthogonal basis.
There is not a lot of literature on Kalman filtering applied to speech, but the two
approaches, of fitting pieces of speech to wavelet bases, and Kalman filtering, would
seem to be going in the same direction and could ultimately converge.
METHOD. An extended Kalman filter to model, for each phoneme, the evolution of
11 MFCC coefficients. A resynthesis of a sequence of phonemes by running the filter,
switching to a model for each phone in turn.
RESULT. The result was not satisfactory it was extremely unclear speech, quite
unintelligible unless the listener knows already what was said.
The above experiment could be repeated with other data for example, instead of
using 11 MFCC coefficients, one could use MFCC coefficients and deltas, or raw
data, or wavelet data.
EXPERIMENT 5. RECOGNISING SPEECH USING SVM.
MOTIVATION. This is a baseline experiment, not using wavelets.
METHOD. The SVM was used to classify phoneme data from 1000 sentences in the
ARCTIC database. The trained SVM was then run to convert an utterance not in the
training, to a sequence of frames each labelled with a phone. This method appeared to
be successful.
When tested on an untrained speaker, the method did not give an understandable
transcription.

20

EXPERIMENT 6. RECOGNISING SPEECH USING H2M (a HMM system).


MOTIVATION. A single word asr demonstration system may suffice for an example
of acoustic analysis, in the sense that it could be used to do phonemic minimal pair
discrimination tests.
MATERIALS. H2M is a HMM system by Daniel Cappe. It includes a demonstration
single word asr system using cepstral coefficients from FFT.
METHOD. Provide data for the demonstation. Run the demonstration.
RESULT. Reasonable results. (About 80% of the words were recognised correctly).
EXPERIMENT 7. RECOGNISING SPEECH USING H2M (a HMM system) WITH
WAVELET DATA.
MOTIVATION.
MATERIALS. H2M is a HMM system by Daniel Cappe. It includes a demonstration
single word asr system using cepstral coefficients from FFT.
METHOD. Modify the code to use wavelet data (cwt) calculated from the SHATR
database. The coefficients used were the total energy in each channel. Run the
demonstration RMSPRECMODEL40. It worked but did not give as good results as
the original program which used cepstral data derived from FFT.
REVIEW. This experiment could be revised in an attempt to find better wavelet data
that could be used. It is also possible I could have adapted the HMM to allow it to
train better, by using much more data. It is also possible that we can get better results
by training several HMMs, one for each feature.
EXPERIMENT 8. TEST A CONTINUOUS SPEECH HMM ASR SYSTEM.
METHOD. I was unable to work out how to train Becchettis continuous speech asr.
Using the already trained system, it transcribed the sample data perfectly, using only a
phonetic model and not a word model.
RESULT. When tested on new speech, the results were occasionally intelligible, but
only for very short utterances. Even for one word utterances, the transcription was
generally very poor. (For comparison, even a panel of human experts would not be
able to transcribe data accurately from a language they had not heard before).

21

EXPERIMENT 9. BUTTERWORTH FILTER BANK.


AIM. Implement a bank of low order Butterworth filters. Square the output (or take
absolute value) and smooth it to obtain a smoothed spectrogram.
This was done in 2002. I did not develop a good method to display spectrogram data a
as a whole, so the snippet below shows average data abstracted from the smoothed
spectrogram.
This is a snippet from a document called installment02 which I wrote in 2002 but did
not submit. I have included this as a tiny record of the work I did in VB6 in 2002.

The above shows two curves. One is the smoothed spectrum averaged over the first
25% of the hand segmented phone. The other is the smoothed segment averaged over
the last 25% of the same hand segmented phone. The plots are normalised by
balancing using the average power over the whole utterance. 1000 filter channels
were used. The filters were equal Q low order Butterworth filters. The data was then
rectified or squared and averaged over a moving window.
EXPERIMENT 10. FORMANT SYNTHESIS.
Fortran code for M. Wagners formant synthesiser was installed and run, from a
photocopy of the code.
I lacked the time resources required to adapt this code to do analysis by synthesis.

22

23

EXPERIMENT 11. PRINTOUTS OF SPECTROGRAMS.


I supplied an example of speech, analysed with both STFT and wavelets, in a report
sent in 2003. (In 2003, Dr Watson asked if I had obtained wavelet spectrograms and
compared these to STFT spectrograms).
COMMENT.
In the early 1980s M. OKane suggested automating spectrogram reading as a way of
doing asr. We could interpret the topic using wavelets to acoustically analyse oral
stops, to encourage the use of automated reading of spectrograms as an asr method.
We note that the output of wavelet transform processing is a wavelet spectrogram, so
doing asr by reading wavelet spectrograms would be a way of using wavelets to
analyse speech.
This could be still a very difficult problem. Image analysis is still a very difficult task.
Without putting huge resources into analysis of images of speech, we dont know
what results could be obtained today. It seems obvious that in the future, with much
more computer power, image analysis will be a usable way to analyse speech.
EXPERIMENT 12. CLASSIFICATIONS OF PHONEMES FROM WAVELET DATA
USING R. In the original statement for the project on the Macquarie website, it was
suggested that classification experiments could be done with wavelet data. In 2003 I
submitted the results of a classification experiment using STFT data. In 2004 I
submitted results for a similar experiment using wavelet data. This copied the format
of an experiment done in SLP806 Speech Recognition, in 2001. Wavelet processing is
used to obtain data for the frames used by the k-means vq classifier. Arbitrary choices
were made. It would be possible to make many (even millions of) other choices of
what data to include in the data frames.Specifically, the wavelet transform was CWT,
window width 256, no overlap, 4 octaves, 4 channels per octave. Summary and scaled
data was derived from this to obtain 34 data points per frame. Arbitrarily, 16 of these
points were discarded, leaving 18 points per frame.The signal was not pre-boosted. It
is not evident that a boost is required, because if the amplitude of the power spectrum
falls off as 1/f, then elementary calculus shows that this gives us equal power per
octave, and hence equal power per filter channel. I dropped 10 points at the left and
right edges of the window. This was an arbitrary choice, but I thought it might lessen
edge effects which could introduce artifacts. Only stop data was used.
y <- matrix(data=atraindata,nrow=1000,ncol=18)
model <- train(y,trainlabs[1:1000])
autolabs <- classify(y,model)
confusion(trainlabs,autolabs)
b
d g k p t
b 80 26 16 13 16 10
d 22 110 12 23 26 26
g 3
1 33 1 0 0
k 18 18 8 81 20 12
p 14 19 6 18 81 10
t 45 44 9 49 32 98

24

EXPERIMENT 13. Electric field metaphor.


D. Ruta and B. Gabrys (2002 ?), Physical Field Models for Pattern Classification,
suggest an electric field metaphor for data which can be seen as charged particles
generating an electric field.
Before reading of their work, I had done a related experiment. Two electrostatic
charges were place in the plane and the electrostatic field calculated at grid points,
Arbitrarily, the result was then interpreted as a spectrogram, and sound was
synthesised from this. Time did not allow further experiments.
Ruta and Gabrys claim that using diverse models can give a vast improvement in the
recognition results for classifiers.
EXPERIMENT 14. This is a very simple demonstration of using the wavelet
spectrogram function from M. Johnsons toolkit.
Data is resynthesised from MFCC coefficients and viewed using a wavelet
spectrogram. This is not claimed to be a meaningful sequence of operations.
STEP 1. Start matlab and load the mat file c:\classification\svm\SEP26svm10.mat.
(This contains data including MFCC data, created earlier).
STEP 2. rmtutor20 searches the database for the first occurrence of the handlabelling
corresponding to the fragment of speech /ih t w aa z/ it was:
The instance is resynthesied from the MFCC data, and played back as sound.
In order to hear the item better, frames from before and after the item can also be
included.
STEP 3. Calculate STFT spectrogram of it was
Cd c:\policy\sm\smtoolbox\
Fs = 16000;
x = im(10000:16000);
[tfr, tfr_s] = smt_stft(x,128,2048,1025);
smt_plotTFR(tfr,tfr_s)
axis([0 5000 0 5000])

25

STEP 4. Resynthesise from the STFT.


outSignal = smt_istft(tfr,tfr_s);
sound(outSignal,16000);
STEP 5. Calculate cwt
[cwt, cwt_s] = smt_cwt(x,256,16);
smt_plotTFR(cwt,cwt_s)
axis([0 500 0 5000])
outSignalC = smt_icwt(cwt, cwt_s);

26

STEP 6. Resynthesise from the cwt.


outSignalC = smt_icwt(cwt, cwt_s);
sound(outSignalC,20000);
EXPERIMENT 15. AD HOC METHOD TO DETECT /s/.
In the report in 2004, a high pass filter H from the waveslim library was used to filter
speech. The speech was then half wave rectified and averaged over a moving window
of width 200 with overlap 199. The high values appeared to match /s/.
This is an extremely ad hoc approach. It is preferable to have an automated approach
which will operate without the user needing to make ad hoc decisions on how to
recognise particular phones.
Nijogi and Sondhi (2001) discuss design of a filter to design stops. This uses ad hoc
input data (energy in two bands, and Wiener entropy data which is a computation on
the spectrum). So this is another example of an ad hoc experiment to recognise
phonemes.
EXPERIMENT 16. METHOD TO DETECT STOPS.
In the report in 2004, speech was processed using CWT 4 octaves, 4 channels per
octave.

27

The hand labels were used to calculate a feature +STOP or STOP for each frame. (I
used oral stops, but the Mannell and Clark document, following Halle and Stevens,
advocates using oral stops plus nasal stops as a group. I have not redone the
experiment to include oral stops and nasal stops together).
Bayes theorem was used to calculate the probability that each vq item has a particular
feature value +STOP.
An utterance from the contin database was tested using the probabilities calculated
using the training data. For each frame, the probability that it was a stop was
estimated, given the vq coding of the test sentences.
Results: 4 out 5 stops were detected. One stop was not detected, and there was a false
positive.
This is a very general nave Bayesian method that can be tried for any features, using
any kinds of data.
EXPERIMENT 17. USING NSLTOOLS.
C:\1may2005\2005oct22\nsl\nsltools\
This gives a wavelet analysis.
The matlab is from CAAR (Centre for Auditory and Acoustic Research). It
implements the 2-d wavelet transform.
A waveform is analysed as a 4-d matrix of wavelet data. The system displays a 3-d
slice of this data as a succession of 2-d frames, giving a movie.
This data may be very useful for asr but I have not worked out how to use it. Perhaps
just one of the 2-d slices could be selected automatically and used as data for a
recogniser. Earlier in this report I suggested that an experiment could be done in
passing all the data to an SVM classifier, as SVM is said to be able to handle large
numbers of coefficients successfully.This experiment was not done.
EXPERIMENT 18. FORMANT TRACKS.
Code from Dan Elliss web site was used to synthesise sound from formant data. Even
a straight line trajectory of the point (F1,F2,F3) gave a very complex sequence of
sounds.

28

EXPERIMENTS NOT DONE.


N1. CALCULATING STATISTICS. Fatemi-Ghomi suggests calculating local
statistics of wavelet coefficients. Examples of local statistics are local variance of
windows, local mean of squared values in windows, local mean of absolute values in
windows. (Fetami-Ghomi p 53). A possible approach to the project is to derive a
method for identifying segments, based on the probability that a particular segment
has a particular feature. The local statistics could be used as features.
N2. A FEATURE FOR LABIAL STOPS.
P. Lu (1999), Identifying Key Phoneme Features in Spectrograms. He quotes Ali who
computes maximum normalised spectral slope which is low for labial stops. This is
the magnitude of the maximum value over the stop burst of the partial derivative of
the spectrogram intensity with respect to frequency, divided by the maximum of the
set of values created when the intensities are summed over all frequencies.
N3. SVD ANALYSIS OF DWT DATA.
Analyse speech using DWT. Use SVD to process a sequence of frames, under a
moving window. Obtain a time trajectory for a small number of eigenvectors.
N4. USING WAVELETS TO MATCH PHONEMES.
C. J. Long and S. Datta suggest using different wavelet bases to recognise different
stop classes. For example, they use Local Cosine Transform to model /d/, and
Daubechies wavelet of order 4 to model /s/ and /z/. Within each type of wavelet, they
find the translation and dilation parameters which give the best match.
N5. GRID MODEL OF VOCAL TRACT.
A grid model with a time varying boundary could be used to model, e.g, the vocal
tract in sagittal plane, uttering a word or syllable. Wavelets can be used to solve grid
models, although currently there is no evidence they give better results than other
methods.
In Praat, P. Boesma has a demonstration articulary system which the user can
experiment with by mouse clicks to define the time trajectory of articulatory
parameters. The result is computed and played as sound (e.g. /b a/ ). The results are
obtained by solving equations for the vocal system.
N6. MODEL OF AREA FUNCTION SEQUENCE.
Speech could be generated by defining a a time sequence of area functions which
define the time varying as a sequence of 1-d functions. I know of no attempts to look
at this problem using wavelets. But a wavelet basis can be used to represent any
function as a sum of components.

29

REFERENCES.
C. Becchetti and L. Prina Ricotti (1999), Speech Recognition, Wiley.
D. Berners and J. Smith, On the use of Schroedingers Equation in the Analytic
Determination of Horn Reflectance,Proceedingsofthe1994International
ComputerMusicConference,Arrhus.1994,pp.419422.
M. Blomberg and K. Elenius (1981), Experiments with a segment-based speech
recognition system.
O. Cappe (2001), H2M Matlab functions for the estimation of mixtures and hidden
Markov models, www.tsi.enst.fr/~cappe/h2m
I. Christov (2004), Multiscale Image Edge Detection,
www.math.tamu.edu/~christov/downloads/wedgedet.pdf
C. Demars (2000), Representations bidimensionnelles dun signal de parole,
http://www.limsi.fr/Individu/chrd/
Dan Ellis (2005), http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/
Emu (1999). Supplied with J. Harrington and S. Cassidy, Techniques in Speech
Acoustics, Kluwer.
Fourier, Theorie analytique de chaleur, www-fourier.ujf-grenoble.fr/chaleur.html
A. Gallardo-Antolin et al (2003), A Comparison of Several Approaches to the Feature
Extractor Design for ASR Tasks in Telephone Environment, 15th ICPhS Barcelona.
Malcolm Goris et al (1997), Reducing the Computational Load of a Kalman Filter,
IEEE Letters.
Allen Haar and R. Young, Enhanced Feature Extraction on Stop Consonants using
Wavelet Transform, www.icspat.com/papers/183mfi.pdf
K. Hermus (2004), Signal Subspace Decompositions for Perceptual Speech and Audio
Processing. p 2
X. Huang et al (2001), Spoken Language Processing, Prentice-Hall.
N. Kingsbury, http://cnx.rice.edu/content/m11138/latest/
S. Jaffard et al (2001), Wavelets Tools for Science and technology, SIAM, ISBN 089871-448-6.
M. Johnson (2002), The Spectral Modelling Toolbox,
http://eamusic.dartmouth.edu/~kimo/smt/spectralModelingToolbox.pdf
Junshui Ma and Y Zhao, OSU SVM Classifier Matlab Toolbox.

30

Kamran Mustafa and Ian Bruce (2005), Robust Formant Tracking For Continuous
Speech with Speaker Variability, preprint accepted for publication in IEEE
Transactions on Speech and Audio Processing.
R. Kirchner, Preliminary Thoughts on Phonologisation Within an Exemplar-Based
Speech Processing System, UCLA Working Papers in Linguistics, Vol 6.
C. J. Long and S. Datta, Wavelet based Feature Extraction for Phoneme Recognition,
www.asel.udel.edu/icslp/cdrom/vol1/239/a239.pdf

M. Manko et al (2001), Tomograms and other transforms, J. Phys. A.: Math. Gen 34,
p 8321-8332.
G. Margrave (1997), Nonstationary filtering: review and update, CREWES
Research Report, Volume 9. p 2.
D. Marr and E. Hildreth (1980), D. Marr and E. Hildreth (1980).
Theory of edge detection. Proc. Roy. Soc. Lond., B207:187 - 217. Cited by
http://iria.math.pku.edu.cn/~jiangm/courses/dip/html/node91.html
J. B. Millar (1976). Personal communication (Fortran code listings).
F. Murtagh, J. L. Starck and O. Renaud (2003), On Neuro-Wavelet Modeling.
Kamran Mustafa and Ian Bruce (2003), Robust Formant Tracking for Continuous
Speech with Speaker Variability,
grads.ece.mcmaster.ca/~mkamran/Research/ISSPA2003.pdf
P. Nijogi and M. Sondhi (2001), Detecting Stop Consonants in Continuous Speech.
P. Niyogi and C. Burges (2004), Detecting and Interpreting Acoustic Features by
Support Vector Machines.
Jeff ONeill (1999), download the zip file from Matlab File Exchange:
http://www.mathworks.com/matlabcentral/fileexchange/download.do?objectId=63&fn=DiscreteTFDs&fe=.zip&cid=802830

A. Reza (1999), Wavelet Characteristics, 19 Oct 1999 White Paper.


T. Ridsdill (1999), Wavelet Design of Time-Varying Filters.
E. Robinson (1997),
http://www.ieee.org/organizations/history_center/oral_histories/transcripts/robinson_e.html

F. Rossi and N. Villa (2005), Classification in Hilbert Spaces with Support Vector
Machines.
Evan Ruzanski (2003), Perfect reconstruction filter banks using fifth order
Butterworth filters. cslr.colorado.edu/~ruzanski/ECE258BProjectReport.pdf

31

O. Schwartz and E. Simoncelli (2001), Natural sound statistics and divisive


normalisation in the auditory system, in T. K. Leen et al (ed) (2001), Advances in
Neural Information Processing Systems vol 13, MIT Press.
W. Seto (1971), Acoustics, Schaums Series, McGraw-Hill.
W. Sweldens (1998), The Lifting Scheme: A Construction of Second Generation
Wavelets, http://citeseer.ist.psu.edu/sweldens98lifting.html
Junshui Ma and Yi Zhao (2002), SVM code in Matlab. Their web site seems to have
disappeared, but inactuive links to it can be found in Google.
Beng Tan, M. Fu, A. Spray, and P. Dermody (1996), The use of wavelet transforms
in phoneme recognition. ICSLP Conference on Spoken Language Processing,
Philadelphia.
B. Vidakovic, www.isye.gatech.edu/~brani/datasoft/WavMat.m.
M. Wagner (1978), Experimental Software Speech Synthesizer, Technical Report No
2, Computing Research Group, ANU.

32

APPENDICES.
APPENDIX 1. COMMENTS ON THE PROJECT.
In 2002, the Macquarie website had information on the topic of this project.
Although short in duration compared to other speech sounds, stops are not
associated with invariant acoustic cues. An experiment by Wood in 1970 shows
that if a stops in an utterance is replaced by noise, the listener hears the word that
makes sense in context, and they are not able to hear that the stop has been replaced
by noise.
In listening to short pieces of speech e.g. 0.1s, snipped from labelled speech, one can
sometimes hear other other phones besides those given by the hand labelling. This is
not surprising. For example, the identify of a stop can depend on the following vowel
so we may need to play a piece of longer piece in order to hear the labelled result.
This however, does not show that there are not strong acoustic cues for stops. It could
be, for example, that there are 8 possible cues for a given stop, and a given stop is
heard if at least 4 of these cues are present, and no other stop is indicated.
Kirchner says A word or a phone could be identified by its nearness to other examples
of the same word or phone. Unlike the standard theory, no single property need be
present in every member of the cohort. He says we are not attempting to find
invariant phonetic properties.
It may be that no particular feature needs to be present, because we have a redundant
coding. The only invariant property we need is that the fact that our system
recognises phonemes by comparing speech to examplars using a distance metric.
This leaves open the possibility of discovering more specific invariant phonetic
properties.
The acoustic features of silence, burst , and aspiration are variably evident in
acoustic analysis. Spectral analysis using the FFT may smear out some
interesting acoustic events, depending on the sampling frequency and frame of
the FFT. Another form of spectral analysis which may provide a better spectral
representation is wavelet analysis.
Becchetti (1999) p 129 says: 70ms windows have better frequency resolution, but
fast transitions in the spectrum (as for instance in the case of stop consonants) are not
detected.
STFT implies an arbitrary choice of window width and windowing function. D.
Thomson devised a multiple window method to give more accurate spectral analysis.
His function is in Matlab signal processing toolbox as pmtm. (The example does not
work due to bugs in my student version of Matlab).
It is possible that wavelet analysis will reveal information not shown by using
sinusoidals. Jaffard (2001), p 7. The study of nonstationary signals necessitates

33

techniques different from Fourier analysis. These techniques, which are specific to the
non-stationary character of the signal, include wavelets of the time-frequency type.
Pole-zero filter theory has been used to design wavelets, and it can also be used to
design the time-varying filters.
This project will involve implementing a wavelet algorithm as a function either
in C++ or R.
There is a lot of code on the net for implementing wavelets.
Because of computational costs and development, many wavelet functions are not
available yet. For example, Baraniuk has described wavelets of six and more
dimensions which could be used to match speech.
Then using this function investigate whether wavelet analysis on stops would
provide any more useful information than the traditional spectral analysis.
The details of wavelet spectrograms and STFT spectrograms are different. However,
given a single spectrogram, whether wavelet or STFT, there are many ways of
viewing data, as it can be thresholded, and smoothed.
So the limiting factor for us today may not be the availability of wavelet tools, it may
be the availability of data viewing tools in general. It may be that new things can be
extracted from STFT spectrograms. In fact, anything in the speech is in the
spectrogram in a coded form, and could be extracted with the right decoder.
Classification experiments could be performed to contrast the two methods.
A classification experiment was performed for the SLP813 project in 2003, using
functions in R.
H2M, a HHM system written in Matlab with a demonstration asr program for isolated
words, was obtained from the net and used for the SLP813 project. The Matlab code
was adapted to process wavelet data. This did not give improved results over the nonwavelet method. Of course, this does not prove that wavelets are not useful, only that
an experiment has not devised that showed their usefulness.
For the project, an SVM written in Matlab was obtained from the net and used to
classify labelled speech. The trained SVM was then able to give a good decoding (one
phoneme symbol per frame) at the frame level. However, the trained system gave very
poor results in trying to decode utterances by other speakers.
It was expected that results would be obtainable by training separate SVMs to
recognise features. However, results have so far been unsuccessful. If the above
concept had worked, there would have been a baseline system against which to
experiment with wavelet data.

34

APPENDIX 2. Demonstration of asr using SVM.


Aim. Test whether the system successfully decodes an utterance not used in training.
(The speaker was used in training).
1. Run c:\realproject\rmarctic05.m on 1000 sentences
2. Run c:\classification\svm\RMTESTSVMCOEF1TO10.m
Save workspace as C:\classification\svm\SEP26svm10.mat
3. Run c:\realproject\RMARCTIC105.m to process the test waveform
arctic_b0501.wav
4. Run c:\classification\SVM\RMTESTSVMONE.m
Here is a snippet from the utterance: I said.
/ih ay ay ay ae ae ay eh eh ey iy s s s s s s s s s
eh eh eh eh eh eh eh eh ih d/
It should be possible to convert the above output to normal phonetic spelling. I would
have liked to use a HMM to do this but have not done so. A simple rule which says
only print a phoneme symbol when the input consists of N or more consecutive
examples of the phoneme, except in the case of a stop, works ok for N = 2 to 4, for the
above example. For N = 4 this rule gives the result /ay s eh d/ which we can accept
as phonetic spelling for I said.
Above, the decoding was done at the frame level. For a complete asr system we
would need to derive either an ordinary phonetic spelling (or ordinary spelling in
English, if we are only handling speech in English).
APPENDIX 3. MATCHING SPEECH.
APPENDIX 3.1 MATCHING RAW WAVEFORM.
It is possible to compare raw waveform. A simple experiment conducted with 1000
sentences from ARCTIC suggests that an instance of /b/ could be reliably identified
by looking at the labels for the closest matches.
Taking an example of a /b/ phoneme segment from the ARCTIC database, this
segment was used as an FIR filter to filter the utterances in the database. The best
match in each utterance was obtained as the maximum value of the filtered output.
The value of the phone label for the best match was recorded this gives one phone
label per utterance. These values were then sorted by the figure of merit value to find
the best matches.
For /b/, the best 4 matches were /t/, /p/, /b/, /p/. This suggests that this method does
not identify the segment accurately, but it may well identify the item as in the class of
stops.
The above operation is also called (single level) wavelet analysis.
The experiment was repeated, normalising the output at each point by dividing by the
Euclidean length of the item in the window (i.e. the square root of the dot product of

35

the item with itself). This gave a more interesting result in that the top 4 matches were
all /b/ and 18 out of the first 25 matches were /b/.
Results for other stops were not as good as for /b/. Due to lack of time resources,
complete results have not been obtained.
APPENDIX 3.2 MATCHING MFCC DATA.
An example of a /d/ segment was correctly identified by matching against a library of
10,000 frames for the same speaker. The distance metric used was a dot product using
10 MFCC coefficients, energy, and deltas and double deltas.
I dont have complete results due to lack of time.
APPENDIX 4. What is a wavelet?
The geophysicist E. Robinson used the term wavelet in the 1960s and 1970s to
mean a finite piece of waveform of zero mean and finite energy. So Robinsons
definition includes all FIR filters of zero mean.
In 1910, Haar defined a function based on the square wave. Haars function, and also
a function sin(x)/x, later known as the sinc function, introduced by Whittaker in 1915,
both obey a multiscale equation. In the 1980s Daubechies found other functions
which obey a multiscale equation. All these functions are called wavelets. A wavelet
transform decomposes a waveform into a weighted sum of orthogonal wavelets,
where each wavelet is a scaled shifted version of a single wavelet. However the
wavelet packet transform allows (many) different kinds of wavelets to be used in the
one transform.
In the 1990s Sweldens used a method based on predict and adjust steps, to
calculate wavelets.
Today it is not clear that a wavelet is required to obey a multiscaling equation. The
Alpert wavelet does not obey a multiscaling equation instead the orthogonal basis
functions are derived using the Gram-Schmidt orthogonalisation process. In 1996
Sweldens said he was unable to define a limit for where wavelets were heading, other
than FIR filtering generally. This makes the concept of wavelet somewhat
problematic. Perhaps we should be talking about FIR and IIR filtering generally,
rather than wavelets. I note that some very experienced writers on DSP seem to avoid
using the word wavelet in their work.
Sweldens has developed a method of building wavelets by using predict and adjust
steps. Predict and correct steps are very common in digital processing methods such
as DPCM, and Kalman filtering. This makes me wonder if wavelet processing can be
equated with a multi-level digital filtering method. Sweldens work seems to place
wavelets within z-transform theory.
One possible interpretation of wavelets is that it is making results on complex variable
theory more accessible: One writer says that the wavelet transform is equivalent to the
Cauchy contour integral formula. Wavelets are a way to implement filter banks, and a

36

way to implement orthogonal bases. But many other methods exists which are not
wavelet methods.
Unser and Blu say that wavelets can be constructed which calculate exact (discrete
versions of) derivatives. A couple of writers say that wavelets are equivalent to finite
difference methods. This could suggest that wavelets are a way of packaging existing
ideas. From a point of view of scientific method, it is difficult to see why there is any
particular merit in building an orthogonal base from the same wavelet that is shifted
and scaled - from the point of view of scientific method, the only criterion of interest
is conciseness. (Some) wavelets do have a concise formula, but so do Butterworth
filters.
But more and more complex variations of wavelets are being developed. Alpert
wavelets do not use a scaling equation. It is not clear that Sweldens wavelets would
obey the scaling equation they are designed using a different method (factorisation
of Laurent polynomials).
Recently, an Empirical Mode Decomposition has been devised by Huang. This has
some resemblance with wavelet processing, in that it is an iterative method for
approximating a waveform. Steps are (i) identify locally in time, the fastest oscillation
(ii) subtract (iii) iterate on the residual.
Perhaps any cascade of filters could be called a wavelet transform. R. de Quieroz,
Lapped Transforms: The relation between filter banks and discrete wavelets is well
known. Under conditions that are easily satisfied, an infinite cascade of filter banks
will generate a set of continuous wavelet bases.
F. Murtagh et al (2003), p2 : Wavelets and Feature Discovery A nave approach
would consist of using a bank of filters with varying frequencies and widths.
Unfortunately choosing the proper filters is a difficult task. The wavelet transform
provides a sound mathematical principle for designing and spacing filters. I dont
buy their argument. I think in fact that their argument supports saying that any filters
are wavelets.
APPENDIX 5. What is acoustic analysis?
We might summarise our requirements for acoustic analysis of speech as: how do we
process waveforms (or spectrograms) to obtain features? From Chomsky and Halle
(1968), one might assume that matrix methods will suffice. It seems reasonable to
assume that matrix methods have sufficient computational power. Some people are
using large matrices of size 10,000*10,000 or to do ICA on speech such matrices
are too big for my computer, although they will be easily manageable in 5 or 10 years
time.
APPENDIX 6. Examples of CWT spectrogram.
In Matlab, colour pictures of wavelet spectrogram can be obtained by using the
Spectral Modelling Toolbox, M. Johnson (2002), Masters thesis, Dartmouth.

37

This is an example of cwt.

38

The above picture does not show all the detail, because the colour presentation is
quantised. As an experiment in image processing, a picture was obtained by
morphological operations on one of the color planes in the above picture. The
Image Processing toolbox in Matlab was used.
C:\18june2005\

39

This is points 1-4800 off the utterance the price range from Emu.
The above were made by versions of
C:\policy\sm\smtoolbox\rmim05
The data from cwt is complex. If we convert it to real (or imaginary), we can still
reconstruct a waveform that sounds more or less the same.

40

This is a version of the above plot, where the cwt data has been converted to real.

41

The picture below shows the spectrogram between 4KHz and 8KHz, for the points
between points 1701 and 2800 in the waveform. This showed as red in the first
picture.

What we are looking at, near 23 ms on the time scale, is the start of the /p/ release.

--- the end --

42

You might also like