You are on page 1of 41

A NEW ALGORITHM FOR

VOICE SIGNAL COMPRESSION


SUITABLE FOR LIMITED STORAGE DEVICES
A PROJECT REPORT
Submitted by

RAGAVENDRAN.G
RAJESH KUMAR.G
SATHYA NARAYANAN.P
SHANTANU CHAKRABORTY

21407106056
21407106060
21407106070
21407106076

in partial fulfillment for the award of the degree


of

BACHELOR OF ENGINEERING
IN

ELECTRONICS AND COMMUNICATION

SAKTHI MARIAMMAN ENGINEERING COLLEGE


ANNA UNIVERSITY : CHENNAI 600 025
APRIL 2011

A NEW ALGORITHM FOR


VOICE SIGNAL COMPRESSION
SUITABLE FOR LIMITED STORAGE DEVICES
A PROJECT REPORT
Submitted by

RAGAVENDRAN.G
RAJESH KUMAR.G
SATHYA NARAYANAN.P
SHANTANU CHAKRABORTY

21407106056
21407106060
21407106070
21407106076

in partial fulfillment for the award of the degree


of

BACHELOR OF ENGINEERING
in

ELECTRONICS AND COMMUNICATION

SAKTHI MARIAMMAN ENGINEERING COLLEGE, THANDALAM

ANNA UNIVERSITY:: CHENNAI 600 025


APRIL 2011

ANNA UNIVERSITY:: CHENNAI 600 025

BONAFIDE CERTIFICATE
Certified that this project report A NEW ALGORITHM FOR VOICE SIGNAL
COMPRESSION SUITABLE FOR LIMITED STORAGE DEVICES is the
bonafide work of RAGAVENDRAN.G, RAJESHKUMAR.G,
NARAYANAN.P, SHANTANU CHAKRABORTY

SATHYA
who carried out the

project work under my supervision.

SIGNATURE

Dr.H.RANGANATHAN

SIGNATURE

M.SAROJINI

HEAD OF THE DEPARTMENT,

SUPERVISOR

ECE DEPARTMENT,

AP, ECE DEPARTMENT,

SAKTHI MARIAMMAN

SAKTHI MARIAMMAN

ENGINEERING

ENGINEERING

COLLEGE, THANDALAM,

COLLEGE, THANDALAM,

CHENNAI-602105

CHENNAI-602105

SUBMITTED TO VIVA VOICE HELD ON.

INTERNAL EXAMINER

EXTERNAL EXAMINER

ACKNOWLEDGEMENT
With the blessing of Lord almighty and of our revered preceptors, we have
accomplished this project.
Life is short, don't waste time worrying about what people think of you
Hold on to the ones that care, in the end they will be the only ones there. we
deem it a privilege to thank our chairman Dr.K.N.Ramchandran., and our principal
Dr.K.Vijaya Baskara Raju for giving as an opportunity to do this project work.
It makes your heart feel good. It is just good for your soul. The kids
absolutely love this Words are inadequate to express our heart full thanks to our
Head of the Department Dr.H.Ranganathan,. But he is the person who does not
restrict our ideas and We thank our supervisor M.SAROJINI, she is the person who
giuded as all the way and gave the idea of doing this Project. For his hearted
cooperation and for this project sucessfully.
A word of encouragement during a failure is worth more than an hour of
praise after success, we will be ever thankful to our department teaching and non
teaching staffs without whom we may not have reached this position.
"A mind at peace, a mind centered and not focused on harming others, is
stronger than any physical force in the universe. We pay our high regards to our
parents and friends for their constant support. It is they who gave their full support
both economically and mentelly to complete this project.

TABLE OF CONTENTS
CHAPTER NO.

TITLE

PAGE NO.

ABSTRACT

iii

LIST OF TABLE

iv

LIST OF FIGURES
1.

INTRODUCTION
1.1

Auditory properties
1.1.1

Non linear frequency response


of the hear

1.1.2
1.2

Masking property of the auditory system

Audio compression

1.2.1

Lossless compression

1.2.2

Lossy compression

1.2.3

MPEG Audio coding standards

1.3

Speech compression

1.4

Evaluating compressed audio

2.

IMPLEMENTATION

3.

EXISTING (PROBLEM IDENTIFICATION)

10

4.

Wavelet

11
4.1

5.

Properties of Wavelet

12

VOICE SIGNAL COMPRESSION

14

5.1

ENCODING PROCESS

13

5.1.1

SAMPLING

13

5.1.2

WAVELET DECOMPOSITION

14

5.1.3

Fast Fourier Transform (FFT)

15

5.1.4

WAVELET COMPRESSION 15

5.1.5

PSYCHOACOUSTIC MODEL

15

5.1.5.1 Absolute threshold of


Hearing
5.1.5.2 Auditory Masking

15
17

CHAPTER NO.

TITLE

5.2

PAGE NO.

5.1.6

QUANTIZATION

19

5.1.7

BIT ALLOCATION

21

DECODING PROCESS

22

5.2.1

WAVELET RECONSTRUCTION

22

5.2.2

Writing to .wav file format

23

6.

ALGORITHM FOR VSC

24

7.

FLOW CHART FOR VSC

25

8.

RESULT AND DISCUSSION

26

9.

CONCLUSION AND FUTURE WORK

30

10.

11.1 CONCLUSION

30

11.1 FUTURE WORK

30

REFERENCES

31

ABSTRACT
In this project, Voice Signal Compression (VSC) is a technique that is used to
convert the voice signal into encoded form and when compression is required, it can
be decoded at the closest approximation value of the original signal. This work present
a new algorithm to compress voice signals by using an Wavelet Decomposition and
Psychoacoustic Model.
The main goals of this paper are:
i) Transparent Compression of high quality voice signal
ii) To evaluate compressed voice signal with original voice signal with the
help of distortion analysis and frequency spectrum.
iii) To reduce the maximum noise from the compressed file and calculate
the SNR(Signal to Noise Ratio).
To implement this voice signal compression, a wavelet decomposition is used
according to psychoacoustic model criteria and computational complexity of the
decoder. The bit allocation method is used that also takes the input from
Psychoacoustic model. Filter bank structure generates quality of performance in the
form of subband perceptual rate which is computed in the form of perceptual entropy
(PE). Output can get best value reconstruction possible considering the size of the
output existing at the encoder. The result is a variable-rate compression scheme for
high-quality voice signal

iii

LIST OF TABLE
Table

Pg.No

Table 1. Result analysis for voice signal

30

iv

LIST OF FIGURES
TITLE

Pg. no

Figure 1.1 An example that shows how the auditory properties can be
used to compress an digital audio signal.
3
Figure 4.1 Representation of wavelet in 2-dimensional and 3-dimensional

12

Figure 5.1 Block diagram of Encoder

14

Figure 5.2 Block diagram of Decoder

14

Figure 5.3 Wavelet Decomposition

16

Figure 5.4 Simultaneous masking

18

Figure 5.5 Temporal masking

18

Figure 5.6 Critical bandwidth

20

Figure 5.7 Quantizing the signal

22

Figure 5.8 Recontruction of signal

24

Figure 10.1 Result analysis for voice signal compression using the
Sample 1,2,3,4 and 5
Figure 10.2. Graph Memory size vs SNR and Memory size vs Entropy

28
28

1.INTRODUCTION
The work of this project is to introduce the concept of voice signal
compression(VSC). This introduction covers some aspects of psychoacoustics and
presents a brief summary of the current audio compression techniques.
1.1 Auditory properties
1.1.1 Non linear frequency response of the hear
Humans are able to hear frequencies in the range approximately from 20 Hz to
20 kHz. However, this does not mean that all frequencies are heard in the same way.
One could make the assumption that a human would hear frequencies that make up
speech better than others, and that is in fact a good guess. Furthermore, one could also
hypothesize that hearing a tone becomes more difficult close to the extremes
frequencies (i.e. close to 20 Hz and 20kHz).
We have found that the frequency range from 20 Hz to 20 kHz can be broken
up into critical bandwidths, which are non-uniform, non-linear, and dependent on the
level of the incoming sound. Signals within one critical bandwidth are hard to separate
for a human observer. A detailed description of this behavior is described in the Bark
scale and Fletcher curves.
1.1.2 Masking property of the auditory system
Auditory masking is a perceptual property of the human auditory system that
occurs whenever the presence of a strong audio signal makes a temporal or spectral
neighborhood of weaker audio signal imperceptible. This means that the masking
effect can be observed in time and frequency domain. Normally they are studied
separately and known as simultaneous masking and temporal masking.

If two sounds occur simultaneously and one is masked by the other, this is
referred to as simultaneous masking. A sound close in frequency to a louder sound is
more easily masked than if it is far apart in frequency. For this reason, simultaneous
masking is also sometimes called frequency masking. It is important to differentiate
between tone and noise maskers, because tonality of a sound also determines its
ability to mask other sounds. A sinusoidal masker, for example, requires a higher
intensity to mask a noise like masker than a loud noise-like masker does to mask a
sinusoid.
Similarly, a weak sound emitted soon after the end of a louder sound is masked
by the louder sound. In fact, even a weak sound just before a louder sound can be
masked by the louder sound. These two effects are called forward and backward
temporal

masking,

respectively. Temporal

masking

effectiveness

attenuates

exponentially from the onset and offset of the masker, with the onset attenuation
lasting approximately 10 ms and the offset attenuation lasting approximately 50 ms.
It is of special interest for perceptual audio coding to have a precise description
of all masking phenomena to compute a masking threshold that can be used to
compress a digital signal. Using this, it is possible to reduce the SNR and therefore the
number of bits. A complete masking threshold should be calculated using the
principles of simultaneous masking and temporal masking and the frequency response
of the ear. In the perceptual audio coding schemes, these masking models are often
called psychoacoustic models.

Figure 1.1- An example that shows how the auditory properties can be used to
compress an digital audio signal.
1.2 Audio compression
The idea of audio compression is to encode audio data to take up less storage
space and less bandwidth for transmission. To meet this goal different methods for
compression have been designed. Just like every other digital data compression, it is
possible to classify them into two categories: lossless compression and lossy
compression.

1.2.1 Lossless compression


Lossless compression in audio is usually performed by waveform coding
techniques. These coders attempt to copy the actual shape of the analog signal,
quantizing each sample using different types of quantization. These techniques
attempt to approximate the waveform, and, if a large enough bit rate is available they
get arbitrary close to it. A popular waveform coding technique, that is considered
uncompressed audio format, is the pulse code modulation (PCM), which is used by
the Compact Disc Digital Audio (or simply CD). The quality of CD audio signals is
referred to as a standard for hi-fidelity. CD audio signals are sampled at 44.1 kHz and
quantized using 16 bits/sample Pulse Code Modulation (PCM) resulting in a very high
bit rate of 705 kbps.
As mentioned before, human perception of sound is affected by SNR, because
adding noise to a signal is not as noticeable if the signal energy is large enough. When
digitalize an audio signal, ideally SNR could to be constant for al quantization levels,
which requires a step size proportional to the signal value. This kind of quantization
can be done using a logarithmic compander (compressor-expander). Using this
technique it is possible to reduce the dynamic range of the signal, thus increasing the
coding efficiency, by using fewer bits. The two most common standards are the -law
and the A-law, widely used in telephony.
Other lossless techniques have been used to compress audio signals, mainly by
finding redundancy and removing it or by optimizing the quantization process. Among
those techniques it is possible to find Adaptative PCM and Differential quantization.
Other lossless techniques such as Huffman coding and LZW have been directly
applied to audio compression without obtaining significant compression ratio.

1.2.2 Lossy compression


The lossy compression reduces perceptual redundancy;i.e. sounds which are
considered perceptually irrelevant are coded with decreased accuracy or not coded at
all. In order to do this, it is better to have scalar frequency domains coders, because
the perceptual effects of masking can be more easily implemented in frequency
domain by using subband coding. Using the properties of the auditory system we can
eliminate frequencies that cannot beperceived by the human ear, i.e. frequencies that
are too low or too high are eliminated, as well as soft sounds that are drowned out by
loud sounds. In order to determine what information in an audio signal is perceptual
irrelevant, most lossy compression algorithms use transforms such as the Modified
Discrete Cosine Transform (MDCT) to convert time domain sampled waveforms into
a frequency domain. Once transformed into the frequency domain, frequencies
component can be digitally allocated according to how audible they are (i.e. the
number of bits can be determined by the SNR).
Audibility of spectral components is determined by first calculating a masking
threshold, below which it is estimated that sounds will be beyond the limits of human
perception. Briefly, the modified discrete cosine transform (MDCT) is a Fourierrelated transform with the additional property of being lapped. It is designed to be
performed on consecutive blocks of a larger data set, where subsequent blocks are
overlapped so that the last half of one block coincides with the first half of the next
block. This overlapping, in addition to the energy-compaction qualities of the DCT,
makes the MDCT especially attractive for signal compression applications, since it
helps to avoid artifacts stemming from the block boundaries.
1.2.3 MPEG Audio coding standards
Moving Pictures Experts Group (MPEG) is an ISO/IEC group charged with the
development of video and audio encoding standards. MPEG audio standards include
an elaborate description of perceptual coding, psychoacoustic modeling and

implementation issues. It is interesting for our report to mention some brief comments
on these audio coders, because some of the features of the wavelet-based audio
coders are based in those models.
(a) MP1 (MPEG audio layer-1): Simplest coder/decoder. It identifies local tonal
components based on local peaks of the audio spectrum.
(b) MP2 (MPEG audio layer-2): It has an intermediate complexity. It uses data from
the previous two windows to predict, via linear interpolation, the component of
the current window. This is based on the fact that tonal components, being more
predictable, have higher tonality indices.
(c) MP3 (MPEG audio layer-3). Higher level of complexity. Not only includes
masking in time domain but also a more elaborated psychoacoustic model,
MDCT decomposition, dynamic allocation and Huffman coding.
All three layers of MPEG-1 use a polyphase fiterbank for signal decomposition
into 32 equal width subbands. This is a computational simple solution and provides
reasonable time-frequency resolution. However it is known that this approach has
three notable deficiencies:
Equal subbands do not reflect the critical bands of noise masking, and then the
quantization error cannot be tuned properly.
Those filter banks and their inverses do not yield perfect reconstruction, introducing
error even in the absence of quantization error.
Adjacent filter banks overlap, then a single tone can affect two filter banks.
These problems have been fixed by a new format which is considered the
successor of the MP3 format: AAC (Advanced Audio Coding) defined in MPEG-4
Part 3 (with an extension .m4a or namely MP4 audio).
(d) M4A: AAC (MPEG-4 Audio): Similar to MP3 but it increases the number of
subbands up to 48 and fix some issues in the previous perceptual model. It has
6

higher coding efficiency for stationary and transient signals, providing a better
and more stable quality than MP3 at equivalent or slightly lower bitrates.
1.3 Speech compression
Speech signals has unique properties that differ from a general audio/music
signals. First, speech is a signal that is more structured and band-limited around 4kHz.
These two facts can be exploited through different models and approaches and at the
end, make it easier to compress. Many speech compression techniques have been
efficiently applied. Today, applications of speech compression (and coding) involve
real time processing in mobile satellite communications, cellular telephony, internet
telephony, audio for videophones or video teleconferencing systems, among others.
Other applications include also storage and synthesis systems used, for example, in
voice mail systems, voice memo wristwatches, voice logging recorders and interactive
PC software.
Basically speech coders can be classified into two categories: waveform coders
andanalysis by synthesis vocoders. The first was explained before and are not very
used for speech compression, because they do not provide considerable low bit rates.
They are mostly focused to broadband audio signals. On the other hand, vocoders use
an entirely different approach to speech coding, known as parametric coding, or
analysis by synthesis coding where no attempt is made at reproducing the exact
speech waveform at the receiver, but to create perceptually equivalent to the signal.
These systems provide much lower data rates by using a functional model of the
human speaking mechanism at the receiver. Among those, perhaps one of the most
popular techniques is called Linear Predictive Coding (LPC) vocoder. Some higher
quality vocoders include RELP (Residual Excited Linear Prediction) and CELP (Code
Excited Linear Prediction). There are also lower quality vocoders that give very low
7

bit rate such as Mixed Excitation vocoder, Harmonic coding vocoder and Waveform
interpolation coders.
1.4 Evaluating compressed audio
When evaluating the quality of compressed audio it is also convenient to
differentiate between speech signals and general audio/music signals. Even though
speech signals have more detailed methods to evaluate the quality of a compressed
signal (like intelligibility tests), both audio/music and speech share one of the most
common methods: acceptability tests. These tests are the most general way to evaluate
the quality of an audio/speech signal, and they are mainly determined by asking users
their preferences for different utterances. Among those tests, Mean Opinion Score
(MOS) test is the most used one. It is a subjective measurement that is derived entirely
by people listening to the signals and scoring the results from 1 to 5, with a 5 meaning
that speech quality is perfect or transparent. The test procedure requires carefully
prepared and controlled test conditions. The term transparent quality means that
most of the test samples are indistinguishable from the original for most of the
listeners. The term was defined by the European Broadcasting Union (EBU) in 1991
and statistically implemented in formal listening tests since then.
Finally, it is necessary to emphasize that the fact that measures of quality of
audio signal does not have an objective measure that we can extract directly from the
signal (such mean square error), make it more difficult to evaluate it. This is because
subjective evaluations require a large number of test samples and special conditions
during the evaluation.

2.IMPLEMENTATION OF VSC
The voice signal compression is a technique that is used to convert voice signal
into encoded form and when compression is required it is decoded to its closest
approximation of the original signal. The compression technique that we have used is
a lossy compression because its reduces the perceptual redundancy. In the real time,
when we record our voice in .wav file format it will take more memory space to store.
In order to avoid this we have designed a new algorithm for compressing the voice
signal upto approximately 50% of its original voice signal. The main goal of this
algorithm is to compress the high quality audio maintaining transparent quality.
For this purposes, the voice signal is decomposed according to psychoacoustic
model and computational complexity of the decoder. The bit allocation method is used
for this which takes the input from Psychoacoustic model .The idea of voice signal
compression is to encode voice data to take up less storage space and less
transmission bandwidth. This work is well suited to high-quality voice signal transfer
for Internet and storage applications.
The software that we have used is MATLAB 7.6 because in matlab only we are
having wavelet toolbox , so with the help of that we are going to compute the voice
signal .

3.EXISTING (PROBLEM IDENTIFICATION)


A lot of work as been done in the field of data compression and signal analysis
in system, which generates a lot of results in the past few decades . when a given data
is compressed the extension of the file was changed according to the used algorithm .
The family of original data was also changed according to used algorithm. For
example, when a .wav file is compressed to a mp3 by using MPEG algorithm.
The problem identification found in this work is when we compressing the file
in a format, it automatically changes to other file format. In order to rectify this
A wavelet toolbox is used in MATLAB instead of discrete cosine transform (DCT)

10

4.WAVELET
Wavelets are mathematical function that cut up data into different frequency
components (decomposition) and then study each component with a resolution
matched to its scale. A wavelet is a waveform of effectively limited duration that has
an average value of zero. A wavelet is a wave-like oscillation with an amplitude that
starts out at zero, increases, and then decreases back to zero.
There are so many wavelets. They are:
Haar

Daubechies

Biorthogonal

Coiflets

Symlets

Morlet

Mexican Hat

Meyer

Of all these Daubechies wavelet is suitable for voice signal compression and it is
known as This wavelet type has balanced frequency responses but non-linear phase
responses. Daubechies wavelets use overlapping windows, so the high frequency
coefficient spectrum reflects all high frequency changes. Therefore
Daubechies wavelets are useful in compression and noise removal of audio signal
processing. Whereas the other wavelet is not suitable for VSC, because there will be a
shift in the Fourier analysis of frequency spectrum.

11

Figure4.1. Representation of wavelet in 2-dimensional and 3-dimensional


4.1 Properties of wavelet
The following conclude the wavelets properties Wavelets are mathematical
functions such that:
They integrate to zero. So they oscillate.
They have unit energy.
They have compact support. This stems from the fact that a shorter
support in time is required for good time resolution.
An additional property is that the DC component is zero. This comes
from the admissibility condition, which is necessary to develop the inverse
wavelet transform. These conditions help us visualize a wavelet as small
wave.

12

5.VOICE SIGNAL COMPRESSION


This voice signal compression consists of Encoding and Decoding process. The
brief description of these is follows as

5.1 ENCODING PROCESS


BLOCK DIAGRAM

Figure 3. Block diagram of Encoder


The above diagram show the block diagram of Encoding process. The
explanation of each block is given below.
5.1.1 SAMPLING
The first process in performing encoding is sampling which converts
continuous time domain into discrete time domain signal (i.e. conversion of analog
to digital signal). A popular waveform coding technique that is uncompressed audio

13

format is pulse code modulation which is used by compact-disc.The quality of cdaudio signals is refered to as standard for hi-fidelity.CD-audio signals are sampled at
44.1khz and quantized using 16 bps Pcm that results in a very high bit rate 705kbps.

5.1.2 WAVELET DECOMPOSITION


The next step is decomposing a recorded voice into the frames .The frame size
of each decomposed signal is 2048 bytes. Segrating of the signal as frames .The
decomposition process is done by Downsampling.Downsampling is the process of
reducing the sampling rate of a signal by 2.

Figure 5. Wavelet Decomposition

14

5.1.3 Fast Fourier Transform (FFT)


It is used to reduced to computational complexity .FFT converts voice signal
from time domain to frequency domain by taking fft between frame and length of
frame and finding out of power spectral density. Power spectral density is used to
determine how the frequency get distributed in the signal.
5.1.4 WAVELET COMPRESSION
The decomposed signal (i.e) frames are compressed to certain level. Then we
found the amount of compression takesplace in each frame. Then compressed frames
are then bit allocated.
5.1.5 PSYCHOACOUSTIC MODEL
It is based on the study of human perception. These studies show that the
average human does not hear all the frequencies the same. It helps to determine the
given voice quality. This model takes the given signal only in frequency domain
because it will remove unnecessary frequency component in the given signal.
The two main properties of the human auditory system that helps to make up
the psychoacoustic model are
1. Absolute threshold of hearing
2. Auditory masking
5.1.5.1 Absolute threshold of hearing
It is average sound pressure level in which the human hear does not detect any
stimulus. While we are compressing the signal, if a signal has any frequency
components with power levels that falls below the absolute threshold of hearing then,

that particular frequency components should be discarded and the listener will be
unable to hear those frequencies of the signal anyway. Auditory masking is a
15

perceptual property of the human auditory system that occurs whenever the presence
of a strong audio signal makes a temporal or spectral neighborhood of weaker audio
signal imperceptible.
If two sounds occur simultaneously and one is masked by the other, this is
referred to as simultaneous masking. Similarly, a weak sound emitted soon after the
end of a louder sound is masked by the louder sound. In fact, even a weak sound just
before a louder sound can be masked by the louder sound. These two effects are called
forward and backward temporal masking repectively.

Figure 6. Simultaneous masking

16

Figure 7. Temporal masking


5.1.5.2 Auditory Masking
Humans do not have the ability to hear minute differences in frequency. For a
masked signal to be heard, its power level will need to be increased to a level greater
than that of a threshold that is determined by the frequency of the masker tone and its
strength. If noise is strong enough, it can mask a tone that would be clear otherwise.In
a compression algorithm, therefore, one must determine:
I.

Tone Maskers

II.

Noise Maskers

III.

Masking Effects

Tone Maskers :
Determining whether a frequency component is a tone (Masker) requires
knowing whether it has been held constant for a period of time, as well as whether it is

a sharp peak in the frequency spectrum, which indicates that it is above the ambient
noise of the signal.
17

A frequency f (with FFT index k) is a tone if its power spectral P[k] is:
Greater than P [k+1] and P [k-1], i.e. the neighbourhood is upto [k-2 . K+2].
7 dB greater than other frequencies in its neighborhood, where the
neighborhood is dependent on f:
If 0.17 Hz < f < 5.5 kHz, the neighborhood is upto [k- 2k+2]
If 5.5 kHz f < 11 kHz, the neighborhood is upto [k-3k+3]
If 11 kHz f < 20 kHz, the neighborhood is upto [k-6k+6]
Noise Maskers:
If a signal is not a tone, it must be noise. Thus, one can take all frequency
components that are not part of a tone neighborhood and treat them like noise.
Masking Effects:
The maskers which have been determined affect not only the frequencies
within a critical band, but also in surrounding bands. Signals with in one critical
bandwidth are hard to separate for a human observer. Signals within one critical
bandwidth are hard to separate for a human observer. It is found that the frequency
range from 20 Hz to 20 kHz can be broken up into critical bandwidths, which are
nonuniform, non-linear, and dependent on the level of the incoming sound. Signals
within one critical bandwidth are hard to separate for a human observer. The center
frequency location of these critical sub-bands is known as the critical band rate and
approximately follows the expression equation
The distance from one critical band center to the center of the next band is 1
Bark. The human auditory frequency range spreads from 20 Hz to 20 KHz and covers
approximately 25 Barks. General frequency response is

18

Figure 8. Critical bandwidth


5.1.6 QUANTIZATION
It is the process of digitizing the analog signal . Quantization is done to
achieve better compression that reduces the number of bits need to store information,
by reducing the size of integers representing the informing in scene. Quantization is
the process of approximating the given data by representing it with a smaller subset.
Scalar quantization works by quantizing each data item independently. A scalar
quantizer is a function that has a staircase characteristic. The quantizer is usually
designed considering the probability distribution of the input data. The quantization
levels are also written into the output stream along with the quantized data.
It is to be noted that quantization always results in lossy compression; and after the
quantization stage the error introduced into the system is impossible to remove. Vector
quantization has the same aim as that of scalar quantization, namely to represent the
data with a smaller subset. But the quantization to done to vectors, rather than
individual scalars. Vector quantization is more effective in quantizing correlated data
like images and audio signals.
Process of approximating a continuous range of values by a relatively small set
of discrete symbols or integer values. When a continuous signal is quantized the

difference between the continuous signal and the quantized signal is an error. Mu-law
is logarthmic compressor.
19

It packs 16 bits into 8 bits of data.The characteristic of Mu-law is


|y|=log(1+Mu|x|)
log(1+Mu)
Mu = 0 => No compression
Mu increases => More compression
Mu=255

Figure 9. Quantizing the signal

Data vectors: These are the input data which have to be vector quantized.

Code vectors or (codes or code words): A code vector is a representative


of a set of data vectors. The idea is that several similar data vectors
are approximated by the same codeword. This is how vector quantization
approximates data.

Codebook. The set of all codes is called a codebook. The codebook


contains all the codes which appear in the vector quantized data.
20

Encoding region: An encoding region for a particular codeword is the


set of data vectors represented by that codeword. Encoding regions are
mutually disjoint.
Partition: The set of all encoding regions is called a partition. After the
vector quantization process, we get a partition of the input space.
The algorithm requires an initial codebook C(0). This initial codebook is
obtained by the splitting method. In this method, an initial code vector is set as the
average of the entire training sequence. This code vector is then split into two. The
iterative algorithm is run with these two vectors as the initial codebook. The final two
code vectors are split into four and the process is repeated until the desired number of
code vectors is obtained.
The vector quantizer thus returns the codebook C and the partition P. After
this, every vector in the original data is replaced by the index to its corresponding
codeword in the codebook. Since the number of distinct codewords are lesser than
the number of distinct data vectors, replacing them by their indices will reduce the
size of representation. The codebook and the indices are written into the output
stream

5.1.7 BIT ALLOCATION


The next process is bit allocation,bit allocation proceeds with fixed no of
iterations of zero tree algorithm before a perceptual evaluation is done. This algorithm
which helps to organize the coefficients in a tree structure that is temporally aligned
from coarse to fine. This algorithm try to exploit the remnants (bits & pieces) of

temporal coefficients that exist in wavelet packet coefficients. This bit allocation takes
the input from psychoacoustic model and wavelet compression schemes.

21

5.2 DECODING PROCESS


The above diagram show the block diagram of Encoding process. The
explanation of each block is given below.

Figure 4. Block diagram of Decoder

7.1 WAVELET RECONSTRUCTION


The wavelet reconstruction process is done by upsampling. Its the process of
combining each frames. Upsampling is the process of increasing the sampling rate of
a signal by 2.The sampling rate of a signal can be increased by placing equally spaced
zeros beween each pair of samples.

22

Figure 10. Recontruction of signal

7.2 Writing to .wav file format


The reconstructed frames are converted into frequency domain ie., analog
signal and it is written to specific folder as a .wav file format.

23

8. ALGORITHM FOR VSC


Step 1: Load the recorded voice signal
Step 2: Select the wavelet family
Step 3: Choose the frame size
Step 4: Sample the voice signal
Step 5: Decompose the sampled voice signal
Step 6: Compute the threshold value of the signal
Step 7: Compression the decomposed signal
Step 8: Calculate the PSD using the length 2048 FFT
Step 9: Calculate the power tone masker
Step 10: If the selected voice signal is a tone,then go to next step otherwise exit
Step 11:Calculate Energy of the signal
Step 12: Calculate Entropy of the signal
Step 13: Calculate SNR of the signal
Step 14: Calculate no. of bits required for quantization
Step 15: Compute offset to shift memory location of entire partition
Step 16: Reconstruct the compressed signal based on multilevel decomposition
Step 17:Evaluate the distorsion between the reconstructed and the original
signal
Step 18:Write the reconstructed voice signal in .wav file

24

9. FLOW CHART FOR VSC


START

LOAD THE RECORDED VOICE SIGNAL

CHOOSE THE WAVELET FAMILY


CHOOSE THE FRAME SIZE
SAMPLE THE RECORDED VOICE SIGNAL

DECOMPOSE THE VOICE SIGNAL

COMPRESS THE WAVELET


DECOMPOSED FRAMES

FIND THE PSD


OF THE SIGNAL

BIT ALLOCATION

HEADER INSERTION

STRAEM OF DATA

HEADER EXTRACTION

PSYCHOACOUSTIC
MODEL

FIND THE POWER


TONE MASKER,
SIGNAL ENERGY,
ENTROPY & SNR

WAVELET RECONSTRUCTION
OUTPUT(STORED IN .WAV WARE
HOUSE)

FIND THE NO OF BITS


REQUIRED FOR
QUANTIZATION
STOP

OUTPUT(STORED IN .WAV WAREHOUSE)

STOP

10. RESULT AND DISCUSSION


The program is implemented in MATLAB software and simulation is done. The
result analysis are found as different samples. They are:
Original audio signal
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
0

10

15

Time in [seg]

20

25

30

20

25

30

Reconstructed audio signal


0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-1.2
-1.4
0

10

15

Time in [seg]

SAMPLE 1

Original audio signal

2
1.5
1
0.5
0
-0.5
-1
-1.5
-2

10

Time in [seg]

15

20

25

15

20

25

Reconstructed audio signal

2
1.5
1
0.5
0
-0.5
-1
-1.5
-2

10

Time in [seg]

SAMPLE 2

26
Original audio signal
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1

0.5

1.5

Time in [seg]

Reconstructed audio signal

0.5

-0.5

-1

1.2

1.4

1.6

1.8

2.2

Time in [seg]

2.4

SAMPLE 3

2.6

2.8

3.2

Original audio signal

0.5
0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0.25

0.3

0.35

0.4

0.45

Time in [seg]

0.5

0.55

0.6

0.65

0.5

0.55

0.6

0.65

Reconstructed audio signal


0.5
0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0.25

0.3

0.35

0.4

0.45

Time in [seg]

SAMPLE 4

27
Original audio signal

0.5

-0.5

1.2

1.3

1.4

1.5

Time in [seg]

1.6

1.7

1.8

1.6

1.7

1.8

Reconstructed audio signal

0.5

-0.5

1.2

1.3

1.4

1.5

Time in [seg]

SAMPLE 5

Figure 10.1 Result analysis for voice signal compression using the Sample 1,2,3,4
and 5

Figure 10.2. Graph Memory size vs SNR and Memory size vs Entropy
28

Table 1. Result analyisis for voice signal


Samples

Memory before
compression (MB)

Sample 1
Sample 2
Sample 3
Sample 4
Sample 5

10 MB
3.56 MB
734 KB
500 KB
294 KB

TABLE 1
Level
SNR
(dB)
8
6
5
7
5

49.19
45.81
36.03
34.56
42.13

Entropy
(bits)

Distortion
analysis

Memory after
compression (MB)

6.5774
6.8191
5.3281
8.6960
8.3211

0.0481
0.0241
0.0050
0.0029
0.0098

5.04 MB
1.78 MB
367 KB
250 KB
147 KB

DISCUSSION
In this VSC, we took five samples as mentioned in the above table. We have
noticed that the memory size of the voice signal is compressed upto 50%
approximately. For example, in sample 4, the memory size 500 KB is compressed to
250 KB by finding out Signal-to Noise ratio, Entropy and Distortion analysis between

the compressed voice signal and the original voice signal. Even if the level was
changed the memory size is compressed upto 50% approximately. The Signal-to-noise
ratio and Entropy are measured in dB and bits respectively.

29

11.CONCLUSION AND FUTURE WORK

11.1 CONCLUSION
On the basis of daubechies

wavlet family, wavelet decomposition and

psychoacoustic model compression is done approximately 48% to 50% of source file.


In the whole work it is found that every recorded voice contains energy that displays
average information(entropy) for each frame of recorded voice , SNR is computed
from the entropy. During the computation of the voice signal, frame size is decided
2048 bytes for decomposition and then finding out threshold value , the voice signal is
compressed . the compressed signal are reconstructed to get better compression result.
This work is useful for limited storage devices and global transmission medium.

11.2 FUTURE WORK


From the collected data, findings are that there is long way to compete other
compression schemes like mp3, .mp4, .avi.etc. There is need to increase the
compression ratio which is under the development. Analysis for different parameters
(i.e. Single Eco, Multiple Eco, Fade-In, Fade-Out, Pitch Low, Pitch High, FIR Reverb,
IIR Reverb, Flat Response...etc) of this compressed voice signal is also possible for
the purpose of security or voice signal identification which under the research.

30

12.REFERENCES
[1]

Fred

Halsall.

Multimedia

Communications,

PEARSON

Education, Third
Indian Reprint 2003
[2]

D. Sinha and A. Tewfik. Low Bit Rate Transparent Audio


Compression using Adapted Wavelets, IEEE Trans. ASSP, Vol.
41, No. 12, December 1993.

[3]

J.I. Agbinya, Discrete Wavelet Transform Techniques in Speech


Processing, IEEE Tencon Digital Signal Processing Applications
Proceedings, IEEE, New York, NY, 1996, pp 514-519.

[4]

Ken C. Pohlmann Principles of Digital Audio, McGraw-Hill,


Fourth edition, 2000.

[5]

X. Huang, A. Acero & H-W. Hon Spoken Language Processing: A


Guide to Theory, Algorithm and System Development, Pearson
Education, 1st edition 2001.

[6]

S.G. Mallat. "A Wavelet Tour of Signal Processing." 2nd Edition.


Academic Press, 1999. ISBN 0-12-466606-X

[7]

J.G. Proakis and D.G. Manolakis, Digital Signal Processing:


Principles, Algorithms, and Applications, Prentice-Hall, NJ, Third
Edition, 1996.

[8]

Mathworks, Student Edition of MATLAB, Version 6.5, PrenticeHall, NJ. Websites:


http://www.m4a.com/,
http://www.vialicensing.com/products/mpeg4aac/standard.html
http://is.rice.edu/%7Ewelsh/elec431/index.html
http://perso.wanadoo.fr/polyvalens/clemens/wavelets/wavelets

.html
http://www.mp3developments.com/article4.php

31

You might also like