AUDIO STEGANOGRAPHY FOR COVERT DATA

AUDIO STEGANOGRAPHY FOR COVERT DATA TRANSMISSION BY
IMPERCEPTIBLE TONE INSERTION

Kaliappan Gopalan1 and Stanley Wenndt2
1
Department of Engineering, Purdue University Calumet, Hammond, IN 46323

gopalan@calumet.purdue.edu
2
Multi-Sensor Exploitation Branch, AFRL/IFEC, Rome, NY 13441
Stanley.Wenndt @rl.af.mil
ABSTRACT
This paper presents the technique of embedding data in
an audio signal by inserting low power tones and its
robustness to noise and cropping of embedded speech
samples. Experiments on the embedding procedure applied
to cover audio utterances from noise-free TIMIT database
and a noisy database demonstrate the feasibility of the
technique in terms of imperceptible embedding, high data
rate and accurate data recovery. The low power levels
ensure that the tones are inaudible in the message-embedded
stego signal. Besides imperceptibility in hearing, the
spectrogram of the stego signal also conceals the existence
of embedded information. Both of these features render the
detection of embedding in the stego signal difficult to
accomplish. Oblivious detection of the stego signal, instead
of escrow detection, yields the embedded information
accurately. In addition, results of two cases of attacks on the
data-embedded stego audio, namely, additive noise and
random cropping, show the technique is robust for covert
communication and steganography.
Keywords: Audio Steganography, Imperceptible
insertion
tone
1. INTRODUCTION
Covert communication by embedding a message or data
file in a cover medium has been increasingly gaining
importance in the all-encompassing field of information
technology.
Audio steganography is concerned with
embedding information in an innocuous cover speech in a
secure and robust manner.
Communication and
transmission security and robustness are essential for
transmitting vital information to intended sources while
denying access to unauthorized persons. By hiding the
information using a cover or host audio as a wrapper, the
existence of the information is concealed during
transmission. This is critical in applications such as
battlefield communications and bank transactions, for
example. FRAME = CUADRO
Steganography, in general, relies on the imperfection of

the human auditory and visual systems.
Audio
steganography takes advantage of the psychoacoustical
masking phenomenon of the human auditory system [HAS].
Psychoacoustical, or auditory masking property renders a
weak tone imperceptible in the presence of a strong tone in
its temporal or spectral neighborhood. This property arises
because of the low differential range of the HAS even
though the dynamic range covers 80 dB below ambient
level [1, 2]. Frequency masking occurs when human ear
cannot perceive frequencies at lower power level if these
frequencies are present in the vicinity of tone- or noise-like
frequencies at higher level. Additionally, a weak pure tone
is masked by wide-band noise if the tone occurs within a
critical band. This property of inaudibility of weaker
sounds is used in different ways for embedding information.
Embedding of data by inserting inaudible tones in cover
audio signal has been presented recently [3,4]. The
following sections describe the tone insertion technique and
its robustness in retrieving the embedded information in the
presence of noise and in cropped frames of received speech.
2. EMBEDDING BY TONE INSERTION

The tone insertion method relies on the inaudibility of
low power tones in the presence of significantly higher
spectral components an indirect exploitation of the
psychoacoustic masking phenomenon in the spectral
domain. Experiments were conducted using utterances from
(a) the TIMIT database, and (b) the Greenflag database
consisting of noisy recordings of air traffic controllers, as
host or cover audio samples.
In the first experiment, two tones at frequencies f 0 and f1
are generated for embedding bit 0 and bit 1 respectively. As
seen in Fig. 1, the host audio is divided into nonoverlapping segments of 16 ms in duration. For the host
utterances used, f0 is set at 1875 Hz and f 1 at 2625 Hz
arbitrarily. For every frame of host audio, the frame power
fe, is computed and only one bit of data is embedded into the
host audio frame. If the bit to be embedded is a 0, then the

a bit of 1, the power of f1 is set at 0.25 percent of f e and the
Host Audio
Segment to 16 ms
frames
Compute frame
power, fe
Information
to embed
Embed 0
N
o
Power of f1 0.25% of fe
Power of f0 0.001 of f1
Y
Power of f0 0.25% of fe
Power of f1 0.001 of f0
Quantize
to 16 bits
power of f0 is set at 0.25 percent of fe and the power of f1 is

set at 0.001 of that of f0. To embed
For recovering the covert information from every
received frame of audio, the frame power fe is computed
along with the power p0 and p1 at f0 and f1. If the ratio, (fe/
p0) > (fe/ p1), then the covert bit is declared a 0. Otherwise,
the covert bit embedded in the frame is considered a 1.
Embedding Two Bits Per Audio Frame
The second experiment extended the technique to double
the payload capacity by using four tones.
For this
experiment, one tone out of a selection of four tones,
namely, 750 Hz,
1250 Hz, 1875 Hz, and 2625 Hz, was
set to 0.25 percent of the average power of each frame while
the other tones were set to negligible values. For detection,
the ratio of frame power to power at each tone was used.
This ratio, clearly, is a minimum for the tone that was set at
0.25 percent of the frame power.
To add further security in transmission, a 4-bit key for
each frame was used at the transmitter to determine which
one of the four tones would be set at high power relative to
the other three tones. Embedded two-bit combination from
each frame was recovered by using the same key at the
receiver and detecting the relative power ratio of each tone
and the frame. The results of these experiments with noise
and cropping are presented in the next section.
3. EXPERIMENTAL RESULTS
Transmit
Frame
power of f0 is set at 0.001 of the power of f1. The

simultaneous setting of significant and extremely low
powers to the tones facilitates concealed embedding and
correct detection of data. The low and relatively high power
Fig. 1 Tone insertion algorithm
ratios avoid one or both of the tones being detected in
hearing or spectrogram if only one of the tones is set to
a fixed power ratio relative to the frame power, the other
tone may be noticeable in cases where the host frame
inherently has a substantial component at the tone
frequency. The second advantage is that a known high/low
ratio of power between the tones facilitates the detection of
the embedded bit even when the embedded amplitudes are
scaled or quantized. The frames with their spectral
components at the tone frequencies set in accordance with
the data bits constituted the stego signal. For transmission,
the embedded-frame is quantized to 16 bits, which is the
same as the original host audio sample size.
In the first experiment using the TIMIT utterance, She

had your dark suit in greasy wash water all year, which is
available as 16 bit samples at the rate of 16,000 per second,
nonoverlapped frames of 256 samples were used to embed
one bit in each. With 208 frames, a random data of 208 bits
were embedded by inserting tones at 1875 Hz and 2625 Hz
with appropriate powers, and the tone-inserted stego signal
was quantized to 16 bits for transmission. From the stego,
all 208 bits were successfully recovered from the ratios of
frame power to power at the tone frequencies. From
informal listening tests and from the spectrograms, the stego
signal was found indistinguishable from the unembedded
host audio signal.
In the second experiment, four tones were used to embed
two bits in each frame. In addition, successive frames for
embedding were overlapped with 50 percent to further
increase the payload capacity.
After verifying the
imperceptibility of and the data recovery from the stego
signal, the technique was extended for use in covert
battlefield communication in which the hidden information
can be another utterance. For initial studies, the utterance,
seven one spoken by a male speaker, was used as the
Figs. 2 and 3 show the host and the stego signals and
their spectrograms using the frequency-hopped four-tone
insertion for embedding the covert message. No perceptual
or otherwise detectible difference was noticed between the
host and the stego signals and all the embedded data were
correctly recovered from the stego signal.
4
4
2
0
-2
0.2
0.4
0.6
10
0.8
1.2
1.4
1.6
1.8
2
5
x 10
Stego - 2 bits/frame
x 10
5
0
-5
0.2
0.4
0.6
0.8
1
1.2
Sample index
Host - TIMIT
Frequency
8000
6000
4000
2000
0
10
10
8000
Time
12
6000
4000
2000
0
Time
12
Fig. 3 Spectrograms of host (top) and stego with 2800 bits

embedded (bottom)
Host - TIMIT
x 10
To study the effect of noise on the recovery of embedded

information, Gaussian noise with zero mean and average
power proportional to the frame power of embedded stego
was added to each frame. Random noise at low power is
unlikely to increase the power of any of the three tones to a
level that exceeds the power level of the significant tone.
Hence, the pair of the embedded bits from each of the noiseadded stego frames was successfully recovered up to noise
power set to 25 percent of frame power. Bit errors started
showing up at higher noise power levels. However, speech
decoded from GSM-coded speech with up to 10 percent bit
errors results in sufficient quality to convey the message
albeit with noise. Hence, noise power levels as high as 5
percent of frame power, which resulted in only 70 to 80 bits
in error out of a total of 2800 bits, can be used to transmit
covert messages in encoded form with the tone embedding
technique.
Frequency
covert message. This utterance was represented in the

Global System for Mobile communication half-rate (GSM
06.20) coding scheme resulting in a compact form of 2800
bits. Two TIMIT utterances (each with 16 bit samples and
16,000 samples/s) were concatenated to accommodate the
large covert information size. With two bits inserted in each
host frame of 256 samples, only the first 1400 overlapped
frames out of a total of 1542 were used for embedding all
the covert message bits. This gives an embedding capacity
of 2800 bits in 11.208 s , or 249.82 bits/s.
Tones for insertion were selected at frequencies of 687.5
Hz, 1187.5 Hz, 1812.5 Hz, and 2562.5 Hz. These
frequencies were either absent or weak in the host frames.
With four tones, however, an additional step was
necessitated to prevent the detection of embedding.
Presence of a continuous stream of 0s or 1s in the covert
data, for instance, results in the same tone being set at 0.25
percent of the corresponding frame power. Although a
listener may not be able to perceive the tone because of its
low power, the spectrogram is likely to show continuous
spectral nulls or holes at the remaining three tone
frequencies. To a malicious attacker, these artifacts are
indicative of host manipulation even without the knowledge
of host spectrogram. Use of a 4-bit key that sets the order of
the tones for each frame by frequency hopping avoids such
an obvious detection of embedding [3].
1.4
1.6
1.8
2
5
x 10
Fig. 2 Host (top) and stego with 2800 bits embedded

(bottom)
A more serious attack on the stego during transmission

than additive noise is the random deletion of a few samples.
Since removal of up to one in 50 samples has been shown to
cause no perceptible difference [5], we studied its effect on
data recovery. With the removal of five samples from each
stego frame of 256 samples and replacing them with zeros,
bit errors of 4 to 10 out of 2800 were observed. When 10
samples in each stego frame were replaced zeros, the bit
error increased to 52 to 62. When the removed samples
were replaced by their neighbors, the error increased to
about 30 for five sample replacement and to about 80 for 10
sample replacement. Still, as noted previously, a malicious
attack on the stego may render the host message noisy while
still carrying the coded covert audio message in a
perceivable form.
Because of the high level of intrinsic noise in the host,

the dominant tone power was raised to more than 10
percent. Although the stego signal did not show any
perceptual difference from the host, the higher tone power
started showing up in the spectrogram. To mask the
dominant tone in the spectrogram, the tones were set to
frequencies in the range where the host has significant
energy. In the 400 Hz to 1000 Hz range, for example, the
host has relatively high spectral energy over almost the
entire duration. Hence, inserting tones at frequencies of 625
Hz, 750 Hz, 875 Hz and 1062.5 Hz, with as much as 25
percent of frame
Host - GF
x 10
10
9
4
x 10
x 10
5
0
-5
4
5
Sample index
Frequency
3000
2000
1000
0
4000
9
Time
3000
2000
1000
0
8
9
Time
Fig. 5 Spectrograms of Greenflag host (top) and stego with

2502 bits (bottom)
power in the dominant tone did not result in any noticeable
difference in speech quality or spectrogram; the inserted
tones, randomized because of the hopping key, were clearly
masked by the already significant spectral components in
the host. Fig. 5 shows the spectrograms of the host and
stego for comparison.
With additive noise raised up to each stego frame power,
all 2502 bits were correctly recovered. At higher noise
levels of up to 2.5 times the frame power, the bit error was
below 80.
Cropping by zeroing or replacing from 3 to 50 samples
in each stego frame caused no bit error due to relatively
higher power of the inserted tones. At higher number of
destroyed samples, stego became highly noisy; still, the bit
error was negligible.
-2
Host - GF
4000
Frequency
In addition to the clean host from the TIMIT database,

the frequency-hopped tone insertion was also used in an
experiment with a noisy host from the Greenflag database.
Obtained as 16-bit PCM data at a rate of 8000 samples per
second, the Greenflag database consists of utterances from
the cockpit of fighter aircraft. Because of the high level of
noise inherent in the host, externally introduced tone or
noise arising from embedding is generally not noticeable.
Figs. 4 and 5 show the result of embedding the GSM-coded
covert speech, seven one on a host consisting of two
utterances from the Greenflag database. Using 128 samples
per frame the host of 80128 samples has 1251 frames which
can embed only 2502 bits out of the 2800 bits of the coded
covert speech. This gives an embedding capacity of 249.8
bits/s.
9
4
x 10
Fig. 4 Greenflag utterance host (top) and GSM-coded bits

embedded stego
4. DISCUSSION AND CONCLUSION

The results of the present experiments clearly
demonstrate the feasibility of the proposed technique for
audio steganography with imperceptibility, payload and data
recovery. At the embedding rate of approximately 250
bits/s, the technique has a high payload capacity. Any
attempt to increase the capacity further must use more than
four tones. However, use of eight tones for embedding 3
bits/frame, for example, may lead to audible and/or visible
artifacts unless the selected tones are absent in the host
audio. Also, noise intentional or unintentional may
cause high bit errors at high capacity. Noisy host signals, on
the other hand, can have larger payload and use higher
power without significant loss of data. In general, tones
selected from high energy regions of the host can be masked

in hearing and spectrogram by their low power levels.
Malicious attacks involving replacement of embedded
samples with zeros or neighboring values appear to cause
less loss of data at smaller number of samples. Cropping of
a large number of samples destroys the cover audio,
however.
REFERENCES
[1] W. Bender, D. Gruhl, N. Morimoto and A.Lu,
Techniques for data hiding, IBM Systems Journal, Vol.
35, Nos. 3 & 4, pp. 313-336, 1996.
[2] M.D. Swanson, M. Kobayashi, and A.H. Tewfik,
Multimedia data-embedding and watermarking
technologies, Proc. IEEE, Vol. 86, pp. 1064-1087, June
1998.
[3] K. Gopalan, S. Wenndt, A. Noga, D. Haddad, and S.
Adams, Covert Speech Communication Via Cover
Speech By Tone Insertion, Proc. of the 2003 IEEE
Aerospace Conference, Big Sky, MT, Mar. 2003 (on
CD).
[4] K. Gopalan, et al, Covert Speech Communication Via
Cover Speech By Tone Insertion, U.S. Patent applied
for, Oct. 2003.
[5] R.J. Anderson and F.A.P. Petitcolas, On the limits of
steganography, IEEE J. Selected Areas in
Communications, Vol. 16, No. 4, pp.474-481, May 1998.

AUDIO STEGANOGRAPHY FOR COVERT DATA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AUDIO STEGANOGRAPHY FOR COVERT DATA

Uploaded by

Copyright:

Available Formats

AUDIO STEGANOGRAPHY FOR COVERT DATA TRANSMISSION BY

IMPERCEPTIBLE TONE INSERTION

Department of Engineering, Purdue University Calumet, Hammond, IN 46323

Steganography, in general, relies on the imperfection of

2. EMBEDDING BY TONE INSERTION

host audio frame. If the bit to be embedded is a 0, then the

power of f0 is set at 0.25 percent of fe and the power of f1 is

power of f0 is set at 0.001 of the power of f1. The

In the first experiment using the TIMIT utterance, She

Fig. 3 Spectrograms of host (top) and stego with 2800 bits

To study the effect of noise on the recovery of embedded

covert message. This utterance was represented in the

Fig. 2 Host (top) and stego with 2800 bits embedded

A more serious attack on the stego during transmission

Because of the high level of intrinsic noise in the host,

Fig. 5 Spectrograms of Greenflag host (top) and stego with

In addition to the clean host from the TIMIT database,

Fig. 4 Greenflag utterance host (top) and GSM-coded bits

4. DISCUSSION AND CONCLUSION

selected from high energy regions of the host can be masked

You might also like