You are on page 1of 28

CSC 311

CHAPTER FIVE

DATA
COMPRESSION
CSC 311
For many of the applications and uses we make of

modern computers, data compression is absolutely

essential.


Fax
Mp3
Video
TV
etc.
CSC 311
For example: a typical fax uses 40,000 dots per square inch,
using a 56K modem would require more than one minute per
page.

A typical 2 hour movie would require 1.04 * 10
12
bits, far
beyond the capacity of any DVD, yet you can put 2 two hour
movies on a DVD

This is made possible by the use of data compression.
CSC 311
There are fundamentally two types of data

compression:

Lossless



Lossy
CSC 311
Lossless:

Lossless compression techniques allow the
receiver to precisely reconstruct the original
data being transmitted.


Lossy:

Lossy compression techniques allow the receiver
to approximately reconstruct the original data.
CSC 311
Frequency Dependent Codes:

We first want to examine two compression techniques that rely

on the frequency of occurrence of various symbols in constructing

a compression algorithm.

Huffman Codes:

Arithmetic compression
CSC 311
Huffman Codes:

Huffman codes rely on the frequency of use of the
various symbols to produce codes of varying length
to represent the symbols.

Huffman codes display the canonical property that:

No valid Huffman code for any symbol is the prefix
for the code of any other symbol

sometimes called the : no prefix property
CSC 311
Example:

Letter Frequency Huffman Code

A 25 01
B 15 110
C 10 111
D 20 10
E 30 00

Note: Huffman codes are not unique, but a properly formed
Huffman code will always be optimal


CSC 311
CSC 311
Arithmetic Compression:

Another frequency dependent compression technique.

Based on representing a character string as a single
real number.

Assigning ranges based on frequency:
Letter Frequency % Subinterval [p,q]
A 25 [0,0.25]
B 15 [0.25,0.40]
C 10 [0.4,0.5]
D 20 [0.5,0.7]
E 30 [0.7,1.0]
CSC 311
How does it work?
We calculate the new interval based on the old interval
and the probabilities of the current symbol

The interval, in this case would change from 0.3-0.9
to 0.45 0.60
CSC 311
Math shown in next slide
CSC 311
Step String Next Current[x,y] [p,q] width new x new y
Char Interval y-x x-x+w*p y=x+w*q

1 - C [0,1] [0.4,0.5] 1.0 0 + 1*0.4=0.4 0+1*0.5=0.5

2 C A [0.4,0.5] [0,0.25] 0 .1 0.4+0.1*0=0.4 0.4+0.1*0.25=0.425

3 CA B [0.4,0.425] [0.25,0.40] 0.025 0.4+ 0.025*0.25=0.40625 0.4+0.025*0.4=0.41

4 CAB A [0.40625,0.41] [0.0.25] 0.00375 0.40625+0.00375*0= 0.40625+.0..375*.025=
O.40625 0.4071875

5 CABA C [0.40625, [0.4,0.5] 0.0009375 0.40625+0.0009375 0.40625+0.0009375
0.4071875] *0.40= 0.406625 *0.5= 0.4067187



We could choose any number in the interval [0.406625-,0.4067187] to represent
the string ABCAC

Suppose we send N = 0.4067. The receiver only knows the number we sent and the
contents of the original table of symbols and their probabilities.

How do we produce the original string?
CSC 311
Step N Interval[p,q] Width Char N-p Divide by width

1 0.4067 [0.4,0.5] 0.1 C 0.0067 0.067

2 0.067 [0, 0.25] 0.25 A 0.067 0.268

3 0.268 [0.25,0.40] 0.15 B 0.018 0.12

4 0.12 [0,0.25] 0.25 A 0.12 0.48

5 0.48 [0.4,0.5] 0.10 C 0.08 0.8





How do we know when to stop? Obviously we could continue the decoding process begun
above, but there are no more characters actually encoded in the message.

It is customary to include a terminating character in the code, when you decode the
terminating character, you stop.

Number of characters that can be encoded is limited by the precision of real number
representation on your machine.
CSC 311
Run Length Encoding:

Look for long runs of one or zero.

Example: Runs of the same bit, here we are looking for long
runs of 0, which one may commonly find in fax transmissions, for
example.


CSC 311
If we encounter a run of more than 15 zeros, how can we specify

that with only 4 bit codes? This would allow only 15 zeros max.


To send for example; 20 zeros we would send:

1111 0101

The receiver assumes when it sees 1111, that the next code is a
continuation of the previous.

How then might we send a code for 30 zeros?

1111 1111 0000

the code for 0 zeros is needed to terminate
the 1111 code
Lempel ZIV Compression
LZ is a compression that realizes compression ratios of up to 20 to 1.

It relies on the fact that, in any document, character strings are going

to be repeated.

For example: in legal documents such as contracts, one is likely to find

phrases such as: whereas the party of the first part, repeated many

times in the document. Would it not be nice if we could, rather than

sending the thirty five individual characters contained in the above

phrase, simply send a single integer, such as 18 an have the

receiver understand that 18 stands for the above phrase?
Lempel-ZIV Compression
Lempel-ZIV provides an elegant algorithm for accomplishing this.


The sender has the original message and a previously agreed upon symbol table, usually the

set of allowable characters in the alphabet.


The receiving party knows nothing to the message content, but it knows what the contents

and organization of the symbol table are.
Lempel-ZIV Compression
Let us suppose, at the senders end, we wish to send the message:

ABABAAABBCACABABACAC

The sender would have the following symbol table, assuming that all possible messages

consist only of patterns of the characters: A B and C.

Beginning Symbol Table:
0 A
1 B
2 C
The receiver, knowning that all messages are composed only of the characters A,B, and C,

would have a similar symbol table at the beginning:

0 A
1 B
2 C

Lempel-ZIV Compression
At the sending end, the sender will keep track of the following information:

The goal is to build an expanded symbol table containing all of the character patterns encountered so far.

One pass through the algorithm is the processing of a new character in the message, the sender tracks the following
info:
Pass Buffer Current What is sent What is stored New buffer
Content char in table content

1 A B 0 (code for A) AB (code = 3) B


The algorithm begins by sending the first character, the first pass thru the loop begins by reading

the second character B

The senders symbol table would now look as follows:
0 A
1 B
2 C
3 AB
Lempel-ZIV Compression
At the other end of the transmission, the receiver is trying to reconstruct the symbol table that
the sender is building. The receiver is gathering the following info:

Pass Prior Current Is Current C Tempstring/ What is Printed
(string) (string) Code in Table? 1
st
Code Pair curr or temp?
1 0 (A) 1 (B) Yes B AB/3 B (current)


Since the receiver has received the code for both A and B sequentially, he knows the sender
has seen the character pattern AB and stores this as entry 3 in his table

Receivers table after pass one.
0 A
1 B
2 C
3 AB

Lempel-ZIV Compression
This process continues for the entire Message:
ABABAAABBCACABABACAC
Sender
Pass Buffer Current What is sent What is stored New buffer
Content char in table content
1 A B 0 (code for A) AB (code = 3) B
2 B A 1(code for B) BA (code = 4) A
3 A B -------------- --------------- AB
4 AB A 3 (code for AB) ABA(code=5) A
5 A B ___________ ________ AB
6 AB C 3(code for AB) ABC(code =6) C
7 C B 2(code for C) CB(code = 7) B
8 B A ________ _________ BA
9 BA B 4 (code for BA) BAB (code = 8) B
10 B A ________ _________ BA
11 BA B _______ ________ BAB
12 BAB A 8(code for BAB) BABA(code=9) A

Pass Prior Current Is Current C Tempstring/ What is Printed
(string) (string) Code in Table? 1
st
Code Pair curr or temp?
1 0 (A) 1 (B) Yes B AB/3 B (current
2 1(B) 3(AB) Yes A BA/4 AB(current)
3 3(AB) 3(AB) Yes A ABA/5 AB(current)
4 3(AB) 2(C) Yes C ABC/6 C(current)
5 2 ( C ) 4(BA) Yes B CB/7 BA(current)
6 4(BA) 8 No B BAB/8 BAB(temp)

Lempel-ZIV Compression
At this point the sender and receiver symbol tables would contain:

Sender Receiver
0 A A
1 B B
2 C C
3 AB AB
4 BA BA
5 ABA ABA
6 ABC ABC
7 CB CB
8 BAB BAB
9 BABA not yet
CSC 311
Image compression is an example of Lossy Compression:

At just 640 X 480 resolution, a color image would require

7,372,800 bits, for motion, we send 30 images per second

which would require over 220 million bits per second for a single

video stream.

Lossy compression schemes are used to dramatically reduce this

requirement.


CSC 311
We wont got thru the details of how video compression is accomplished, but

I suggest you read the remainder of the chapter for your own enlightenment.

Images that are transmitted consist of three different frame types:


P frame: encoded by computing the differences between the current
frame and the previous frame;

B frame: similar to a P frame except it is interpolated between
a previous and future frame

I frame: just a JPEG encoded image

CSC 311
CSC 311
CSC 311

You might also like