Professional Documents
Culture Documents
Unit-4
AV Compression
Instructional Objectives: At the end of the unit, the students should be able to: 1. Understand the various Audio Compression Techniques. 2. Distinguish the merits of various audio compression techniques. 3. Learn the various video compression techniques. 4. Distinguish the merits of various video compression techniques.
MMC
Unit-4
AV Compression
1. INTRODUCTION:
An audio (sound) wave is a one-dimensional acoustic (pressure) wave. When an acoustic wave enters the ear, the eardrum vibrates, causing the tiny bones of the inner ear to vibrate along with it, sending nerve pulses to the brain. These pulses are perceived as sound by the listener. In a similar way, when an acoustic wave strikes a microphone, the microphone generates an electrical signal, representing the sound amplitude as a function of time. The representation, processing, storage, and transmission of such audio signals are a major part of the study of multimedia systems. The frequency range of the human ear runs from 20 Hz to 20,000 Hz. Some animals, notably dogs, can hear higher frequencies.
The ear hears logarithmically, so the ratio of two sounds with power A and B is conventionally
as 0 dB, an ordinary conversation is about 50 dB and the pain threshold is about 120 dB, a dynamic range of a factor of 1 million. The ear is surprisingly sensitive to sound variations lasting only a few milliseconds. The eye, in contrast, does not notice changes in light level that last only a few milliseconds.
The result of this observation is that jitter of only a few milliseconds during a multimedia transmission
affects the perceived sound quality more than it affects the perceived image quality. Audio waves can be converted to digital form by an ADC (Analog Digital Converter).
An ADC takes an electrical voltage as input and generates a binary number as output. In Fig. 7-1(a) we
see an example of a sine wave. To represent this signal digitally, we can sample it every T seconds, as shown by the bar heights.
If a sound wave is not a pure sine wave but a linear superposition of sine waves where the highest
frequency component present is f, then the Nyquist theorem states that it is sufficient to make samples at a frequency 2f. Sampling more often is of no value since the higher frequencies that such sampling could detect are not present.
2 BSS ECE, REVA
MMC
Unit-4
AV Compression
Digital samples are never exact. The samples of Fig. 7-1(c) allow only nine values, from 1.00 to +1.00
in steps of 0.25. An 8-bit sample would allow 256 distinct values. A 16-bit sample would allow 65,536 distinct values. The error introduced by the finite number of bits per sample is called the quantization noise. If it is too large, the ear detects it. Two well-known examples where sampled sound is used are the telephone and audio compact discs. Pulse code modulation, as used within the telephone system, uses 8-bit samples made 8000 times per second. In North America and Japan, 7 bits are for data and 1 is for control; in Europe all 8 bits are for data. This system gives a data rate of 56,000 bps or 64,000 bps. With only 8000 samples/sec, frequencies above 4 kHz are lost.
Audio CDs are digital with a sampling rate of 44,100 samples/sec, enough to capture frequencies up to
22,050 Hz, which is good enough for people, but bad for canine music lovers. The samples are 16 bits each and are linear over the range of amplitudes. Note that 16-bit samples allow only 65,536 distinct values, even though the dynamic range of the ear is about 1 million when measured in steps of the smallest audible sound. Thus, using only 16 bits per sample introduces some quantization noise (although the full dynamic range is not coveredCDs are not supposed to hurt). With 44,100 samples/sec of 16 bits each, an audio CD needs a bandwidth of 705.6 kbps for monaural and 1.411 Mbps for stereo. While this is lower than what video needs (see below), it still takes almost a full T1 channel to transmit uncompressed CD quality stereo sound in real time. Digitized sound can be easily processed by computers in software. Dozens of programs exist for personal computers to allow users to record, display, edit, mix, and store sound waves from multiple sources. Virtually all professional sound recording and editing are digital nowadays. Music, of course, is just a special case of general audio, but an important one. Another important special case is speech. Human speech tends to be in the 600-Hz to 6000-Hz range. Speech is made up of vowels and consonants, which have different properties. Vowels are produced when the vocal tract is unobstructed; producing resonances whose fundamental frequency depends on
3 BSS ECE, REVA
MMC
Unit-4
AV Compression
the size and shape of the vocal system and the position of the speakers tongue and jaw. These sounds are almost periodic for intervals of about 30 msec. Consonants are produced when the vocal tract is partially blocked. These sounds are less regular than vowels.
Some speech generation and transmission systems make use of models of the vocal system to reduce
speech to a few parameters (e.g., the sizes and shapes of various cavities), rather than just sampling the speech waveform.
MMC Unit-4 AV Compression The DPCM signal is computed by subtracting the current contents (Ro) from the new output by the ADC
(PCM).
The register value is then updated before transmission.
Decoder
Decoder simply adds the previous register contents (PCM) with the DPCM. Since ADC will have noise there will be cumulative errors in the value of the register signal.
To eliminate this noise effect predictive methods are used to predict a more accurate version of the previous signal (use not only the current signal but also varying proportions of a number of the preceding estimated signals). These proportions used are known as predictor coefficients. Difference signal is computed by subtracting varying proportions of the last three predicted values from the current output by the ADC. R1, R2, R3 will be subtracted from PCM.
MMC Unit-4 AV Compression The values in the R1 register will be transferred to R2 and R2 to R3 and the new predicted value goes into
R1.
Decoder operates in a similar way by adding the same proportions of the last three computed PCM signals to the received DPCM signal.
MMC Unit-4 AV Compression A second ADPCM standard which is a derivative of G-721 is defined in ITU-T Recommendation G-722
which passes only signal frequencies in the range 50Hz through to 3.5 kHz and the other only frequencies in the range 3.5kHz through to 7kHz.
By doing this the input signal is effectively divided into two separate equal-bandwidth signals, the first
known as the lower subband signal and the second the upper subband signal..
Each is then sampled and encoded independently using ADPCM, the sampling rate of the upper
subband signal being 16 ksps to allow for the presence of the higher frequency components in this subband.
The use of two subbands has the advantage that different bit rates can be used for each. In general the frequency components in the lower subband have a higher perceptual importance than
subband at 16kbps.
The two bit streams are then multiplexed together to produce the transmitted (64 kbps) signal in such a
way that the decoder in the receiver is able to divide them back again into two separate streams for decoding.
MMC
Unit-4
AV Compression
regenerate a sound that is perceptually comparable with the source audio signal.
With this compression technique although the speech can often sound synthetic high levels of
compressions can be achieved. In terms of speech, the three features which determine the perception of a signal by the ear are its:
Pitch: this is closely related to the frequency of the signal. This is important since ear is more sensitive to signals in the range 2-5 kHz. Period: this is the duration of the signal. Loudness: This is determined by the amount of energy in the signal. The input speech waveform is first sampled and quantized at a defined rate
8
MMC Unit-4 AV Compression A block of digitized samples known as segment - is then analysed to determine the various perceptual
used a notification of whether the signal is voiced (generated through the vocal cords) or unvoiced (vocal cords are opened).
5 PERCEPTUAL CODING:
LPC and CELP are used for telephony applications and hence compression of speech signal. PC is designed for compression of general audio such as that associated with a digital television
broadcast.
Using this approach, sampled segments of the source audio waveform are analysed but only those
each signal is non-linear; that is the ear is more sensitive to some signals than others.
Also when multiple signals are present as in audio a strong signal may reduce the level of sensitivity of
the ear to other signals which are near to it in frequency, an effect known as frequency masking.
When the ear hears a loud sound it takes a short but a finite time before it could hear a quieter sound an
MMC
Unit-4
AV Compression
region of signal B. Signal A will no longer be heard as it is within the distortion band.
MMC Unit-4 AV Compression The width of each curve at a particular signal level is known as the critical bandwidth for that
frequency.
The width of each curve at a particular signal level is known as the critical bandwidth. It has been observed that for frequencies less than 500Hz, the critical bandwidth is around 100Hz,
however, for frequencies greater than 500Hz then bandwidth increases linearly in multiples of 100Hz.
Hence if the magnitude of the frequency components that make up an audio sound can be determined, it
becomes possible to determine those frequencies that will be masked and do not therefore need to be transmitted.
masking).
After the ear hears a loud sound it takes a further short time before it can hear a quieter sound. This is known as the temporal masking. After the loud sound ceases it takes a short period of time for the signal amplitude to decay. During this time, signals whose amplitudes are less than the decay envelope will not be heard and hence
with that associated with temporal masking. 5.1 MPEG Audio Coders:
MMC
Unit-4
AV Compression
of analysis filters.
The bank of filters maps each set of 32 (time related) PCM samples into an equivalent set of 32
frequency samples.
Processing associated with both frequency and temporal masking is carried out by the psychoacoustic
model.
In basic encoder the time duration of each sampled segment of the audio input signal is equal to the time
indicate the frequency components whose amplitude is below the audible components.
This is done to have more bits for highest sensitivity regions compared with less sensitive regions.
MMC
Unit-4
AV Compression
used to quantize the 12 frequency components in the subband relative to this level.
Collectively this is known as the subband sample (SBS) format. The ancillary data field at the end of the frame optional and is used to for example to carry additional
coded samples associated with the surround-sound that is present with some digital video broadcasts.
At the decoder section the dequantizers will determine the magnitude of each signal. The synthesis filters will produce the PCM samples at the decoders.
6. VIDEO COMPRESSION:
One approach to compressing a video source is to apply the JPEG algorithm to each frame
frames/s each scene is composed of 180 frames hence by sending those segments of each frame that has movement associated with them considerable additional savings in bandwidth can be made. There are two types of compressed frames - Those that are compressed independently (I- frames)
13 BSS ECE, REVA
MMC
Unit-4
AV Compression
referred to as moving pictures and the terms frames and picture are used interchangeably.
I-frames (Intracoded frames) are encoded without reference to any other frames. Each frame is treated
as a separate picture and the Y, Cr and Cb matrices are encoded separately using JPEG.
Iframes the compression level is small. They are good for the first frame relating to a new scene in a movie. I-frames must be repeated at regular intervals to avoid losing the whole picture as during transmission it
frame.
P-frames are encoded using a combination of motion estimation and motion compensation. The accuracy of the prediction operation is determined by how well any movement between successive
differences between the predicted and actual positions of the moving segments involved. This is known as the motion compensation.
No of P frames between I-frames is limited to avoid error propagation.
MMC
Unit-4
AV Compression
independently using the JPEG algorithm (DCT, Quantization, entropy encoding) except that the quantization threshold values that are used are the same for all DCT coefficients.
PB-Frames
A fourth type of frame known as PB-frame has also been defined; it does not refer to a new frame type
as such but rather the way two neighbouring P- and B-frames are encoded as if they were a single frame. Motion Estimation and Compensation:
Motion estimation involves comparing small segments of two consecutive frames for differences and
should a difference be detected a search is carried out to determine which neighbouring segments the original segment has moved.
To limit the time for search the comparison is limited to few segments. Works well in slow moving applications like video telephony. For fast moving video it will not work effectively. Hence B-frames (Bi-directional) are used. Their
contents are predicted using the past and the future frames.
B- Frames provide highest level of compression and because they are not involved in the coding of other
MMC
Unit-4
AV Compression
P-frame encoding
The digitized contents of the Y matrix associated with each frame are first divided into a two-
are used.
To encode a p-frame the contents of each macro block in the frame known as the target frame are
compared on a pixel-by-pixel basis with the contents of the I or P frames (reference frames).
If a close match is found then only the address of the macro block is encoded. If a match is not found the search is extended to cover an area around the macro block in the reference
frame.
To encode a P-frame, the contents of each macro block in the frame (target frame) are compared on a
pixel-by-pixel basis with the contents of the corresponding macro block in the preceding I- or P-frame.
MMC
Unit-4
AV Compression
B-frame encoding
To encode a B-frame, any motion is estimated with reference to both the immediately preceding I-
MMC
Unit-4
AV Compression
estimation unit which, in turn, depends on the contents of the macro blocks being encoded and the contents of the macro block in the search area of the reference frame that produces the closest match. There are three possibilities: - If the two contents are the same, only the address of the macro block in the reference frame is encoded. - If the two contents are very close, both the motion vector and the difference matrices associated with the macro block in the reference frame are encoded. - If no close match is found, then the target macro block is encoded in the same way as a macro block in an I-frame.
MMC
Unit-4
AV Compression
MMC
Unit-4
AV Compression
of the formatter. Type indicates the type of frame encoded I, P or B: Address identifies the location of the macro block in the frame. Quantization Value is the value used to quantize all the DCT coefficients in the macro block. Motion vector encoded vector. Block representation indicates which of the six 8X8 blocks that make up the macro block are present. B1, B2,..B6: JPEG encoded DCF coefficients for those blocks present.
MMC Unit-4 AV Compression Uses a similar video compression technique as H.261; the digitization format used is the source
intermediate format (SIF) and progressive scanning with a refresh rate of 0 Hz (NTSC) and 25 Hz (for PAL). COMPRESSION:
Compression for I-frames is similar to JPEG for Video typically 10:1 through to 20:1 depending on the
Used in recording and transmission of studio quality audio and video. Different levels of video resolution possible: Low: 352X288 comparable with MPEG-1. Main: 720X576 pixels studio quality video and audio, bit rate up to 15 Mbps. High:1920X1152 pixels used in wide screen HDTV bit rate of up to 80Mbps are
possible.
MPEG-4: Used for interactive multimedia applications over the Internet and over various entertainment
networks.
MPEG standard contains features to enable a user not only to passively access a video sequence using
for example the start/stop/ but also enables the manipulation of the individual elements that make up a scene within a video.
In MPEG-4 each video frame is segmented into a number of video object planes (VOP) each of which
providing the creator of the audio and /or video has provided the facility to be manipulated by the viewer prior to it being decoded and played out.
MMC
Unit-4
AV Compression
MMC
Unit-4
AV Compression
MMC
Unit-4
AV Compression
depends on the available bit rate of the transmission channel and the sound quality required.