Professional Documents
Culture Documents
RAGAVENDRAN.G
RAJESH KUMAR.G
SATHYA NARAYANAN.P
SHANTANU CHAKRABORTY
21407106056
21407106060
21407106070
21407106076
BACHELOR OF ENGINEERING
IN
RAGAVENDRAN.G
RAJESH KUMAR.G
SATHYA NARAYANAN.P
SHANTANU CHAKRABORTY
21407106056
21407106060
21407106070
21407106076
BACHELOR OF ENGINEERING
in
BONAFIDE CERTIFICATE
Certified that this project report A NEW ALGORITHM FOR VOICE SIGNAL
COMPRESSION SUITABLE FOR LIMITED STORAGE DEVICES is the
bonafide work of RAGAVENDRAN.G, RAJESHKUMAR.G,
NARAYANAN.P, SHANTANU CHAKRABORTY
SATHYA
who carried out the
SIGNATURE
Dr.H.RANGANATHAN
SIGNATURE
M.SAROJINI
SUPERVISOR
ECE DEPARTMENT,
SAKTHI MARIAMMAN
SAKTHI MARIAMMAN
ENGINEERING
ENGINEERING
COLLEGE, THANDALAM,
COLLEGE, THANDALAM,
CHENNAI-602105
CHENNAI-602105
INTERNAL EXAMINER
EXTERNAL EXAMINER
ACKNOWLEDGEMENT
With the blessing of Lord almighty and of our revered preceptors, we have
accomplished this project.
Life is short, don't waste time worrying about what people think of you
Hold on to the ones that care, in the end they will be the only ones there. we
deem it a privilege to thank our chairman Dr.K.N.Ramchandran., and our principal
Dr.K.Vijaya Baskara Raju for giving as an opportunity to do this project work.
It makes your heart feel good. It is just good for your soul. The kids
absolutely love this Words are inadequate to express our heart full thanks to our
Head of the Department Dr.H.Ranganathan,. But he is the person who does not
restrict our ideas and We thank our supervisor M.SAROJINI, she is the person who
giuded as all the way and gave the idea of doing this Project. For his hearted
cooperation and for this project sucessfully.
A word of encouragement during a failure is worth more than an hour of
praise after success, we will be ever thankful to our department teaching and non
teaching staffs without whom we may not have reached this position.
"A mind at peace, a mind centered and not focused on harming others, is
stronger than any physical force in the universe. We pay our high regards to our
parents and friends for their constant support. It is they who gave their full support
both economically and mentelly to complete this project.
TABLE OF CONTENTS
CHAPTER NO.
TITLE
PAGE NO.
ABSTRACT
iii
LIST OF TABLE
iv
LIST OF FIGURES
1.
INTRODUCTION
1.1
Auditory properties
1.1.1
1.1.2
1.2
Audio compression
1.2.1
Lossless compression
1.2.2
Lossy compression
1.2.3
1.3
Speech compression
1.4
2.
IMPLEMENTATION
3.
10
4.
Wavelet
11
4.1
5.
Properties of Wavelet
12
14
5.1
ENCODING PROCESS
13
5.1.1
SAMPLING
13
5.1.2
WAVELET DECOMPOSITION
14
5.1.3
15
5.1.4
WAVELET COMPRESSION 15
5.1.5
PSYCHOACOUSTIC MODEL
15
15
17
CHAPTER NO.
TITLE
5.2
PAGE NO.
5.1.6
QUANTIZATION
19
5.1.7
BIT ALLOCATION
21
DECODING PROCESS
22
5.2.1
WAVELET RECONSTRUCTION
22
5.2.2
23
6.
24
7.
25
8.
26
9.
30
10.
11.1 CONCLUSION
30
30
REFERENCES
31
ABSTRACT
In this project, Voice Signal Compression (VSC) is a technique that is used to
convert the voice signal into encoded form and when compression is required, it can
be decoded at the closest approximation value of the original signal. This work present
a new algorithm to compress voice signals by using an Wavelet Decomposition and
Psychoacoustic Model.
The main goals of this paper are:
i) Transparent Compression of high quality voice signal
ii) To evaluate compressed voice signal with original voice signal with the
help of distortion analysis and frequency spectrum.
iii) To reduce the maximum noise from the compressed file and calculate
the SNR(Signal to Noise Ratio).
To implement this voice signal compression, a wavelet decomposition is used
according to psychoacoustic model criteria and computational complexity of the
decoder. The bit allocation method is used that also takes the input from
Psychoacoustic model. Filter bank structure generates quality of performance in the
form of subband perceptual rate which is computed in the form of perceptual entropy
(PE). Output can get best value reconstruction possible considering the size of the
output existing at the encoder. The result is a variable-rate compression scheme for
high-quality voice signal
iii
LIST OF TABLE
Table
Pg.No
30
iv
LIST OF FIGURES
TITLE
Pg. no
Figure 1.1 An example that shows how the auditory properties can be
used to compress an digital audio signal.
3
Figure 4.1 Representation of wavelet in 2-dimensional and 3-dimensional
12
14
14
16
18
18
20
22
24
Figure 10.1 Result analysis for voice signal compression using the
Sample 1,2,3,4 and 5
Figure 10.2. Graph Memory size vs SNR and Memory size vs Entropy
28
28
1.INTRODUCTION
The work of this project is to introduce the concept of voice signal
compression(VSC). This introduction covers some aspects of psychoacoustics and
presents a brief summary of the current audio compression techniques.
1.1 Auditory properties
1.1.1 Non linear frequency response of the hear
Humans are able to hear frequencies in the range approximately from 20 Hz to
20 kHz. However, this does not mean that all frequencies are heard in the same way.
One could make the assumption that a human would hear frequencies that make up
speech better than others, and that is in fact a good guess. Furthermore, one could also
hypothesize that hearing a tone becomes more difficult close to the extremes
frequencies (i.e. close to 20 Hz and 20kHz).
We have found that the frequency range from 20 Hz to 20 kHz can be broken
up into critical bandwidths, which are non-uniform, non-linear, and dependent on the
level of the incoming sound. Signals within one critical bandwidth are hard to separate
for a human observer. A detailed description of this behavior is described in the Bark
scale and Fletcher curves.
1.1.2 Masking property of the auditory system
Auditory masking is a perceptual property of the human auditory system that
occurs whenever the presence of a strong audio signal makes a temporal or spectral
neighborhood of weaker audio signal imperceptible. This means that the masking
effect can be observed in time and frequency domain. Normally they are studied
separately and known as simultaneous masking and temporal masking.
If two sounds occur simultaneously and one is masked by the other, this is
referred to as simultaneous masking. A sound close in frequency to a louder sound is
more easily masked than if it is far apart in frequency. For this reason, simultaneous
masking is also sometimes called frequency masking. It is important to differentiate
between tone and noise maskers, because tonality of a sound also determines its
ability to mask other sounds. A sinusoidal masker, for example, requires a higher
intensity to mask a noise like masker than a loud noise-like masker does to mask a
sinusoid.
Similarly, a weak sound emitted soon after the end of a louder sound is masked
by the louder sound. In fact, even a weak sound just before a louder sound can be
masked by the louder sound. These two effects are called forward and backward
temporal
masking,
respectively. Temporal
masking
effectiveness
attenuates
exponentially from the onset and offset of the masker, with the onset attenuation
lasting approximately 10 ms and the offset attenuation lasting approximately 50 ms.
It is of special interest for perceptual audio coding to have a precise description
of all masking phenomena to compute a masking threshold that can be used to
compress a digital signal. Using this, it is possible to reduce the SNR and therefore the
number of bits. A complete masking threshold should be calculated using the
principles of simultaneous masking and temporal masking and the frequency response
of the ear. In the perceptual audio coding schemes, these masking models are often
called psychoacoustic models.
Figure 1.1- An example that shows how the auditory properties can be used to
compress an digital audio signal.
1.2 Audio compression
The idea of audio compression is to encode audio data to take up less storage
space and less bandwidth for transmission. To meet this goal different methods for
compression have been designed. Just like every other digital data compression, it is
possible to classify them into two categories: lossless compression and lossy
compression.
implementation issues. It is interesting for our report to mention some brief comments
on these audio coders, because some of the features of the wavelet-based audio
coders are based in those models.
(a) MP1 (MPEG audio layer-1): Simplest coder/decoder. It identifies local tonal
components based on local peaks of the audio spectrum.
(b) MP2 (MPEG audio layer-2): It has an intermediate complexity. It uses data from
the previous two windows to predict, via linear interpolation, the component of
the current window. This is based on the fact that tonal components, being more
predictable, have higher tonality indices.
(c) MP3 (MPEG audio layer-3). Higher level of complexity. Not only includes
masking in time domain but also a more elaborated psychoacoustic model,
MDCT decomposition, dynamic allocation and Huffman coding.
All three layers of MPEG-1 use a polyphase fiterbank for signal decomposition
into 32 equal width subbands. This is a computational simple solution and provides
reasonable time-frequency resolution. However it is known that this approach has
three notable deficiencies:
Equal subbands do not reflect the critical bands of noise masking, and then the
quantization error cannot be tuned properly.
Those filter banks and their inverses do not yield perfect reconstruction, introducing
error even in the absence of quantization error.
Adjacent filter banks overlap, then a single tone can affect two filter banks.
These problems have been fixed by a new format which is considered the
successor of the MP3 format: AAC (Advanced Audio Coding) defined in MPEG-4
Part 3 (with an extension .m4a or namely MP4 audio).
(d) M4A: AAC (MPEG-4 Audio): Similar to MP3 but it increases the number of
subbands up to 48 and fix some issues in the previous perceptual model. It has
6
higher coding efficiency for stationary and transient signals, providing a better
and more stable quality than MP3 at equivalent or slightly lower bitrates.
1.3 Speech compression
Speech signals has unique properties that differ from a general audio/music
signals. First, speech is a signal that is more structured and band-limited around 4kHz.
These two facts can be exploited through different models and approaches and at the
end, make it easier to compress. Many speech compression techniques have been
efficiently applied. Today, applications of speech compression (and coding) involve
real time processing in mobile satellite communications, cellular telephony, internet
telephony, audio for videophones or video teleconferencing systems, among others.
Other applications include also storage and synthesis systems used, for example, in
voice mail systems, voice memo wristwatches, voice logging recorders and interactive
PC software.
Basically speech coders can be classified into two categories: waveform coders
andanalysis by synthesis vocoders. The first was explained before and are not very
used for speech compression, because they do not provide considerable low bit rates.
They are mostly focused to broadband audio signals. On the other hand, vocoders use
an entirely different approach to speech coding, known as parametric coding, or
analysis by synthesis coding where no attempt is made at reproducing the exact
speech waveform at the receiver, but to create perceptually equivalent to the signal.
These systems provide much lower data rates by using a functional model of the
human speaking mechanism at the receiver. Among those, perhaps one of the most
popular techniques is called Linear Predictive Coding (LPC) vocoder. Some higher
quality vocoders include RELP (Residual Excited Linear Prediction) and CELP (Code
Excited Linear Prediction). There are also lower quality vocoders that give very low
7
bit rate such as Mixed Excitation vocoder, Harmonic coding vocoder and Waveform
interpolation coders.
1.4 Evaluating compressed audio
When evaluating the quality of compressed audio it is also convenient to
differentiate between speech signals and general audio/music signals. Even though
speech signals have more detailed methods to evaluate the quality of a compressed
signal (like intelligibility tests), both audio/music and speech share one of the most
common methods: acceptability tests. These tests are the most general way to evaluate
the quality of an audio/speech signal, and they are mainly determined by asking users
their preferences for different utterances. Among those tests, Mean Opinion Score
(MOS) test is the most used one. It is a subjective measurement that is derived entirely
by people listening to the signals and scoring the results from 1 to 5, with a 5 meaning
that speech quality is perfect or transparent. The test procedure requires carefully
prepared and controlled test conditions. The term transparent quality means that
most of the test samples are indistinguishable from the original for most of the
listeners. The term was defined by the European Broadcasting Union (EBU) in 1991
and statistically implemented in formal listening tests since then.
Finally, it is necessary to emphasize that the fact that measures of quality of
audio signal does not have an objective measure that we can extract directly from the
signal (such mean square error), make it more difficult to evaluate it. This is because
subjective evaluations require a large number of test samples and special conditions
during the evaluation.
2.IMPLEMENTATION OF VSC
The voice signal compression is a technique that is used to convert voice signal
into encoded form and when compression is required it is decoded to its closest
approximation of the original signal. The compression technique that we have used is
a lossy compression because its reduces the perceptual redundancy. In the real time,
when we record our voice in .wav file format it will take more memory space to store.
In order to avoid this we have designed a new algorithm for compressing the voice
signal upto approximately 50% of its original voice signal. The main goal of this
algorithm is to compress the high quality audio maintaining transparent quality.
For this purposes, the voice signal is decomposed according to psychoacoustic
model and computational complexity of the decoder. The bit allocation method is used
for this which takes the input from Psychoacoustic model .The idea of voice signal
compression is to encode voice data to take up less storage space and less
transmission bandwidth. This work is well suited to high-quality voice signal transfer
for Internet and storage applications.
The software that we have used is MATLAB 7.6 because in matlab only we are
having wavelet toolbox , so with the help of that we are going to compute the voice
signal .
10
4.WAVELET
Wavelets are mathematical function that cut up data into different frequency
components (decomposition) and then study each component with a resolution
matched to its scale. A wavelet is a waveform of effectively limited duration that has
an average value of zero. A wavelet is a wave-like oscillation with an amplitude that
starts out at zero, increases, and then decreases back to zero.
There are so many wavelets. They are:
Haar
Daubechies
Biorthogonal
Coiflets
Symlets
Morlet
Mexican Hat
Meyer
Of all these Daubechies wavelet is suitable for voice signal compression and it is
known as This wavelet type has balanced frequency responses but non-linear phase
responses. Daubechies wavelets use overlapping windows, so the high frequency
coefficient spectrum reflects all high frequency changes. Therefore
Daubechies wavelets are useful in compression and noise removal of audio signal
processing. Whereas the other wavelet is not suitable for VSC, because there will be a
shift in the Fourier analysis of frequency spectrum.
11
12
13
format is pulse code modulation which is used by compact-disc.The quality of cdaudio signals is refered to as standard for hi-fidelity.CD-audio signals are sampled at
44.1khz and quantized using 16 bps Pcm that results in a very high bit rate 705kbps.
14
that particular frequency components should be discarded and the listener will be
unable to hear those frequencies of the signal anyway. Auditory masking is a
15
perceptual property of the human auditory system that occurs whenever the presence
of a strong audio signal makes a temporal or spectral neighborhood of weaker audio
signal imperceptible.
If two sounds occur simultaneously and one is masked by the other, this is
referred to as simultaneous masking. Similarly, a weak sound emitted soon after the
end of a louder sound is masked by the louder sound. In fact, even a weak sound just
before a louder sound can be masked by the louder sound. These two effects are called
forward and backward temporal masking repectively.
16
Tone Maskers
II.
Noise Maskers
III.
Masking Effects
Tone Maskers :
Determining whether a frequency component is a tone (Masker) requires
knowing whether it has been held constant for a period of time, as well as whether it is
a sharp peak in the frequency spectrum, which indicates that it is above the ambient
noise of the signal.
17
A frequency f (with FFT index k) is a tone if its power spectral P[k] is:
Greater than P [k+1] and P [k-1], i.e. the neighbourhood is upto [k-2 . K+2].
7 dB greater than other frequencies in its neighborhood, where the
neighborhood is dependent on f:
If 0.17 Hz < f < 5.5 kHz, the neighborhood is upto [k- 2k+2]
If 5.5 kHz f < 11 kHz, the neighborhood is upto [k-3k+3]
If 11 kHz f < 20 kHz, the neighborhood is upto [k-6k+6]
Noise Maskers:
If a signal is not a tone, it must be noise. Thus, one can take all frequency
components that are not part of a tone neighborhood and treat them like noise.
Masking Effects:
The maskers which have been determined affect not only the frequencies
within a critical band, but also in surrounding bands. Signals with in one critical
bandwidth are hard to separate for a human observer. Signals within one critical
bandwidth are hard to separate for a human observer. It is found that the frequency
range from 20 Hz to 20 kHz can be broken up into critical bandwidths, which are
nonuniform, non-linear, and dependent on the level of the incoming sound. Signals
within one critical bandwidth are hard to separate for a human observer. The center
frequency location of these critical sub-bands is known as the critical band rate and
approximately follows the expression equation
The distance from one critical band center to the center of the next band is 1
Bark. The human auditory frequency range spreads from 20 Hz to 20 KHz and covers
approximately 25 Barks. General frequency response is
18
difference between the continuous signal and the quantized signal is an error. Mu-law
is logarthmic compressor.
19
Data vectors: These are the input data which have to be vector quantized.
temporal coefficients that exist in wavelet packet coefficients. This bit allocation takes
the input from psychoacoustic model and wavelet compression schemes.
21
22
23
24
BIT ALLOCATION
HEADER INSERTION
STRAEM OF DATA
HEADER EXTRACTION
PSYCHOACOUSTIC
MODEL
WAVELET RECONSTRUCTION
OUTPUT(STORED IN .WAV WARE
HOUSE)
STOP
10
15
Time in [seg]
20
25
30
20
25
30
10
15
Time in [seg]
SAMPLE 1
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
10
Time in [seg]
15
20
25
15
20
25
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
10
Time in [seg]
SAMPLE 2
26
Original audio signal
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0.5
1.5
Time in [seg]
0.5
-0.5
-1
1.2
1.4
1.6
1.8
2.2
Time in [seg]
2.4
SAMPLE 3
2.6
2.8
3.2
0.5
0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0.25
0.3
0.35
0.4
0.45
Time in [seg]
0.5
0.55
0.6
0.65
0.5
0.55
0.6
0.65
0.3
0.35
0.4
0.45
Time in [seg]
SAMPLE 4
27
Original audio signal
0.5
-0.5
1.2
1.3
1.4
1.5
Time in [seg]
1.6
1.7
1.8
1.6
1.7
1.8
0.5
-0.5
1.2
1.3
1.4
1.5
Time in [seg]
SAMPLE 5
Figure 10.1 Result analysis for voice signal compression using the Sample 1,2,3,4
and 5
Figure 10.2. Graph Memory size vs SNR and Memory size vs Entropy
28
Memory before
compression (MB)
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
10 MB
3.56 MB
734 KB
500 KB
294 KB
TABLE 1
Level
SNR
(dB)
8
6
5
7
5
49.19
45.81
36.03
34.56
42.13
Entropy
(bits)
Distortion
analysis
Memory after
compression (MB)
6.5774
6.8191
5.3281
8.6960
8.3211
0.0481
0.0241
0.0050
0.0029
0.0098
5.04 MB
1.78 MB
367 KB
250 KB
147 KB
DISCUSSION
In this VSC, we took five samples as mentioned in the above table. We have
noticed that the memory size of the voice signal is compressed upto 50%
approximately. For example, in sample 4, the memory size 500 KB is compressed to
250 KB by finding out Signal-to Noise ratio, Entropy and Distortion analysis between
the compressed voice signal and the original voice signal. Even if the level was
changed the memory size is compressed upto 50% approximately. The Signal-to-noise
ratio and Entropy are measured in dB and bits respectively.
29
11.1 CONCLUSION
On the basis of daubechies
30
12.REFERENCES
[1]
Fred
Halsall.
Multimedia
Communications,
PEARSON
Education, Third
Indian Reprint 2003
[2]
[3]
[4]
[5]
[6]
[7]
[8]
.html
http://www.mp3developments.com/article4.php
31