Professional Documents
Culture Documents
The word data is in general used to mean the information in digital form on which computer
programs operate, and compression means a process of removing redundancy in the data. By
'compressing data', we actually mean deriving techniques or, more specifically, designing
efficient algorithms to:
represent data in a less redundant fashion
remove the redundancy in data
Implement compression algorithms, including both compression and decompression.
Data Compression means encoding the information in a file in such a way that it takes less space.
Compression is used just about everywhere. All the images you get on the web are compressed,
typically in the JPEG or GIF formats, most modems use compression, HDTV will be compressed
using MPEG-2, and several file systems automatically compress files when stored, and the rest
of us do it by hand. The task of compression consists of two components, an encoding algorithm
that takes a message and generates a compressed representation (hopefully with fewer bits),
and a decoding algorithm that reconstructs the original message or some approximation of it
from the compressed representation.
Compression denotes compact representation of data.
Examples for the kind of data we typically want to compress are e.g.
text
source-code
arbitrary files
images
video
audio data
speech
Correlation: adjacent data samples tend to be equal or similar (e.g. think of images or
video data).There are different types of correlation:
Spatial correlation
Spectral correlation
Temporal correlation
In addition, in many data types there is a significant amount of irrelevancy since the human brain
is not able to process and/or perceive the entire amount of data. As a consequence, such data can
be omitted without degrading perception. Furthermore, some data contain more abstract
properties which are independent of time, location, and resolution and can be described very
efficiently (e.g. fractal properties).
Compression techniques are broadly classified into two categories:
Lossless Compression
A compression approach is lossless only if it is possible to exactly reconstruct the original data
from the compressed version. There is no loss of any information during the compression
process.
For example, in Figure below, the input string AABBBA is reconstructed after the execution of
the compression algorithm followed by the decompression algorithm.
Lossless compression is called reversible compression since the original data may be recovered
perfectly by decompression.
Lossless compression techniques are used when the original data of a source are so important
that we cannot afford to lose any details. Examples of such source data are medical images, text
and images preserved for legal reason, some computer executable files, etc.
In lossless compression (as the name suggests) data are reconstructed after compression without
errors,i.e. no information is lost. Typical application domains where you do not want to loose
information is compression of text, files, fax. In case of image data, for medical imaging or the
compression of maps in the context of land registry no information loss can be tolerated. A
further reason to stick to lossless coding schemes instead of lossy ones is their lower
computational demand.
Lossless Compression typically is a process with three stages:
The model: the data to be compressed is analyzed with respect to its structure and the
relative frequency of the occurring symbols.
The encoder: produces a compressed bitstream / file using the information provided by
the model.
The adaptor: uses information extracted from the data (usually during encoding) in order
to adapt the model (more or less) continuously to the data.
The most fundamental idea in lossless compression is to employ codewords which are shorter (in
terms of their binary representation) than their corresponding symbols in case the symbols do
occur frequently. On the other hand, codewords are longer than the corresponding symbols in
case the latter do not occur frequently
A lossy data compression method is one where compressing data and then decompressing it
retrieves data that may well be different from the original, but is "close enough" to be useful in
some way. Lossy data compression is used frequently on the Internet and especially in streaming
media and telephony applications. These methods are typically referred to as codecs in this
context. Most lossy data compression formats suffer from generation loss: repeatedly
compressing and decompressing the file will cause it to progressively lose quality. This is in
contrast with lossless data compression.
lossless data compression algorithms will always fail to compress some files; indeed, any
compression algorithm will necessarily fail to compress any data containing no discernible
patterns. Attempts to compress data that has been compressed already will therefore usually
result in an expansion, as will attempts to compress encrypted data.
Measure of Performance
Codeword: A binary string representing either the whole coded data or one coded data
symbol
Coded Bitstream: the binary string representing the whole coded data.
Lossless Compression: 100% accurate reconstruction of the original data
Lossy Compression: The reconstruction involves errors which may or may not be tolerable
Bit Rate: Average number of bits per original data element after compression
This is almost a saving of half the number of bits compared to 3 bits/symbol using a
3 bit fixed length code.
The shorter the codewords, the shorter the total length of a source file. Hence the
code would be a better one from the compression point of view.
Unique decodability
Variable length codes are useful for data compression. However, a variable length
code would be useless if the codewords could not be identified in a unique way from
the encoded message.
Example Consider the variable length code (0, 10, 010, 101) for alphabet (A, B, C,
D). A segment of encoded message such as '0100101010' can be decoded in more
than one way. For example, '0100101010' can be interpreted in at least two ways, '0
10 010 101 0' as ABCDA or '010 0 101 010' as CADC.
A code is uniquely decodable if there is only one possible way to decode encoded
messages. The code (0, 10, 010, 101) in Example above is not uniquely decodable
and therefore cannot be used for data compression.
However, if a code is not a prefix code, we cannot conclude that the code is not
uniquely decodable. This is because other types of code may also be uniquely
decodable.
Example Consider code (0, 01, 011, 0111) for (A, B, C, D). This is not a prefix code
as the first codeword 0 is the prefix of the others,
However, given an encoded message 01011010111, there is no ambiguity and only
one way to decode it: 01 011 01 0111, i.e. BCBD. Each 0 offers a means of selfpunctuating in this example. We only need to watch out the O, the beginning of
each codeword and the bit 1 before any O, the last bit of the codeword.
Some codes are uniquely decodable but require looking ahead during the decoding
process. This makes them not as efficient as prefix codes.
The coding process generates a binary tree, the Huffman code tree, with
branches labeled with bits (0 and 1).
The Huffman tree (or the character codeword pairs) must be sent with the
compressed information to enable the receiver decode the message.
For each distinct character create a one-node binary tree containing the character
and its frequency as its priority;
Insert the one-node binary trees in a priority queue in increasing order of frequency;
while (there are more than one tree in the priority queue) {
dequeue two trees t1 and t2;
Create a tree t that contains t1 as its left subtree and t2 as its right subtree; // 1
priority (t) = priority(t1) + priority(t2);
insert t in its proper location in the priority queue; // 2
}
Assign 0 and 1 weights to the edges of the resulting tree, such that the left and
right edge of each node do not have the same weight; // 3
Note: The Huffman code tree for a particular set of characters is not
unique.
(Steps 1, 2, and 3 may be done differently).
Example: : Information to be transmitted over the internet contains the following
t
Character
5
3
22
18
45
13
65
45
Frequenc
y
Now assign codes to edges in the tree. Put left edge as 0 and right edge as 1
The sequence of zeros and ones that are the arcs in the path from the root to each
leaf node are the desired codes:
t
character
00
010
0111
111
0110
10
110
Huffman codeword
If we assume the message consists of only the characters a,e,l,n,o,s,t then the
number of bits for the compressed message will be 696:
If the message is sent uncompressed with 8-bit ASCII representation for the
characters, we have 261*8 = 2088 bits.
Assuming that the number of character-codeword pairs and the pairs are included
at the beginning of the binary file containing the compressed message in the
following format:
codeword
101
100
111
110
1100
=> face
=> eaace
A code is optimal if the average length of the codewords equals the entropy of the
source.
Let
And notice
This equation holds if and only if lj = -log2p j for all j = 1 , 2 , . . n , because lj has
to be an integer (in bits). Since the length lj has to be an integer (in bits) for
Huffman codes, - log2 pj has to be an integer, too. Of course, - log2 pj cannot be an
integer unless pj is a negative power of 2, for all j = 1 , 2 , . . . ,n.
In other words, this can only happen if all probabilities are negative powers of 2 in
Huffman codes, for lj has to be an integer (in bits).
For example, for a source P (1/2' 1/4' 1/8' 1/8) . Huffman codes for the source can
be optimal
Optimality of Huffman coding
We show that the prefix code generated by the Huffman coding algorithm is optimal
in the sense of minimizing the expected code length among all binary prefix codes
for the input alphabet.
algorithm returns a binary prefix code of lowest expected code length among all
prefix codes for the input alphabet.
(Basis) If n = 2, the Huffman algorithm finds the optimal prefix code, which
assigns 0 to one symbol of the alphabet and 1 to the other.
(Induction Hypothesis) For some n 2, Huffman coding returns an optimal prefix
code for any input alphabet containing n symbols.
(Inductive Step) Assume that the Induction Hypothesis (IH) holds for some value
of n. Let A be an input alphabet containing n + 1 symbols. We show that Huffman
coding returns an optimal prefix code for A.
Let a1 and a2 be the two symbols of smallest probability in A. Consider the merged
alphabet A` =merge(A, a1, a2) as defined above. By the IH, the Huffman tree T` for
this merged alphabet A` is optimal. We also know that T` is in fact the same as the
tree merge(T, a1, a2) that results from the Huffman tree T for the original alphabet
A by replacing a1 and a2 with their parent node. Furthermore, the expected code
length L for T exceeds the expected code length L` for T` by exactly the sum p of
the probabilities of a1 and a2.
We claim that no binary coding tree for A has an expected code length less than L =
L` +p.
Let T2 be any tree of lowest expected code length for A. Without loss of generality,
a1 and a2 will be leaves at the deepest level of T2, since otherwise one could swap
them with shallower leaves and reduce the code length even further. Furthermore,
we may assume that a1 and a2 are siblings in T2. Therefore, we obtain from T2 a
coding tree T2` for the merged alphabet A` through the merge procedure described
above, replacing a1 and a2 by their parent labeled with`the sum of their
probabilities:
T2` =merge(T2, a1, a2). By the observation above, the expected code lengths L2
and L2` of T2 and T2` respectively satisfy L2 = L2` + p. But by the IH, T` is optimal
for the alphabet A`. Therefore,
L2` L`. It follows that L2 = L2` + p L` + p = L,
Which shows that T is optimal as claimed.
For example, consider the following message probabilities, and codes. symbol probability code 1
code 2
Both codings produce an average of 2.2 bits per symbol, even though the lengths are quite
different in the two codes. Given this choice, is there any reason to pick one code over the other?
For some applications it can be helpful to reduce the variance in the code length. The variance
is defined as
With lower variance it can be easier to maintain a constant character transmission rate, or reduce
the size of buffers. In the above example, code 1 clearly has a much higher variance than code 2.
It turns out that a simple modification to the Huffman algorithm can be used to generate a code
that has minimum variance. In particular when choosing the two nodes to merge and there is a
choice based on weight, always pick the node that was created earliest in the algorithm. Leaf
nodes are assumed to be created before all internal nodes. In the example above, after dand e
are joined, the pair will have the same probability as c and a (.2), but it was created afterwards,
so we join c and a. Similarly we select b instead of ac to join with de since it was created earlier.
This will give code 2 above, and the corresponding Huffman tree in Figure below
This gives a gap of 1- 0.72 = 0.28 bit. The performance of the Huffman encoding
algorithm is, therefore, (0.28/1) = 28% worse than optimal in this case.
The idea of extended Huffman coding is to encode a sequence of source symbols
instead of individual symbols. The alphabet size of the source is artificially
increased in order to improve the code efficiency. For example, instead of
assigning a codeword to every individual symbol for a source alphabet, we derive a
codeword for every two symbols.
The following example shows how to achieve this:
Example 4.8 Create a new alphabet S`= (aa, ab, ba, bb) extended from S = (a, b).
Let aa = A, ab = B, ba = C and bb = D. We now have an extended alphabet S' = (A,
B, C, D). Each symbol in the alphabet S` is a combination of two symbols from the
original alphabet S. The size of the alphabet S` increases to 22 = 4 .
Suppose symbol 'a' or 'b' occurs independently. The probability distribution for S`,
the extended alphabet, can be calculated as below:
PA = Pa X Pa = 0.64
PB = Pa X Pb = 0.16
PC = Pb X Pa = 0.16
PD = Pb X Pb = 0.04
We then follow the normal static Huffman encoding algorithm to derive the
Huffman code for S.
The canonical minimum-variance code for S' is (0, 11, 100, 101), for A, B,C, D
respectively. The average length is 1.56 bits for two symbols.
The original output became 1.56/2 = 0.78 bit per symbol. The efficiency of the
code has been increased to 0.72//0.78 ~ 92%. This is only ( 0.7 8 -0.72)/0.78 ~ 8%
worse than optimal.
Dynamic/Adaptive Huffman Coding
For Huffman coding, we need to know the probabilities for the individual symbols
(and we also need to know the joint probabilities for blocks of symbols in extended
Huffman coding). If the probability distributions are not known for a file of
characters to be transmitted, they have to be estimated first and the code itself has
to be included in the transmission of the coded file. In dynamic Huffman coding a
particular probability (frequency of symbols) distribution is assumed at the
transmitter and receiver and hence a Huffman code is available to start with. As
source symbols come in to be coded the relative frequency of the different symbols
is updated at both the transmitter and the receiver, and corresponding to this the
code itself is updated. In this manner the code continuously adapts to the nature of
the source distribution, which may change as time progresses and different files are
being transferred.
RICE CODES
Rice Encoding (a special case of Golomb Coding) can be applied to reduce the bits
required to represent the lower value numbers. Rice's algorithm seemed easy to
implement
Named after Robert Rice, Rice coding is a specialised form of Golomb coding. It's
used to encode strings of numbers with a variable bit length for each number. If
most of the numbers are small, fairly good compression can be achieved. Rice
coding is generally used to encode entropy in an audio/video codec.
Rice coding depends on a parameter k and works the same as Golomb coding with a parameter m
where m = 2k. To encode a number, x:
Algorithm Overview
Given a constant M, any symbol S can be represented as a quotient (Q) and remainder (R), where:
S = Q M + R.
If S is small (relative to M) then Q will also be small. Rice encoding is designed to reduce the
number of bits required to represent symbols where Q is small.
Rather than representing both Q and R as binary values, Rice encoding represents Q as a unary
value and R as a binary value.
For those not familiar with unary notation, a value N may be represented by N 1s followed by a 0.
Example: 3 = 1110 and 5 = 111110.
Note: The following is true for binary values, if log2(M) = K where K is an integer:
1. Q = S >> K (S left shifted K bits)
2. R = S & (M - 1) (S bitwise ANDed with (M - 1))
3. R can be represented using K bits.
Encoding
Rice coding is fairly straightforward.
Given a bit length, K. Compute the modulus, M using by the equation M = 2K. Then do following
for each symbol (S):
1. Write out S & (M - 1) in binary.
2. Write out S >> K in unary.
Example:
Encode the 8-bit value 18 (0b00010010) when K = 4 (M = 16)
1. S & (M - 1) = 18 & (16 - 1) = 00010010 & 1111 = 0010
2. S >> K 18 >> 4 = 0b00010010 >> 4 = 0b0001 (10 in unary)
Decoding
Decoding isn't any harder than encoding.
As with encoding, given a bit length, K. Compute the modulus, M using by the equation M = 2K .
Then do following for each encoded symbol (S):
1. Determine Q by counting the number of 1s before the first 0.
2. Determine R reading the next K bits as a binary value.
3. Write out S as Q M + R.
Example:
Decode the encoded value 100010 when K = 4 (M = 16)
1. Q = 1
2. R = 0b0010 = 2
3. S = Q M + R = 1 16 + 2 = 18
Rice coding only works well when symbols are encoded with small values of Q.
Since Q is unary, encoded symbols can become quite large for even slightly large
Qs. It takes 8 bits just to represent the value 7. One way to improve the
compression obtained by Rice coding on generic data is to apply reversible
transformation on the data that reduces the average value of a symbol. The
Burrows-Wheeler Transform (BWT) with Move-To-Front (MTF) encoding is such a
transform.
Developement:
Jacob Ziv and Abraham Lempel had introduced a simple and efficient compression method
published in their article "A Universal Algorithm for Sequential Data Compression". This
algorithm is referred to as LZ77 in honour to the authors and the publishing date 1977.
Fundamentals:
LZ77 is a dictionary based algorithm that addresses byte sequences from former contents instead
of the original data. In general only one coding scheme exists, all data will be coded in the same
form:
Sequence length
If no identical byte sequence is available from former contents, the address 0, the sequence
length 0 and the new symbol will be coded.
Example "abracadabra":
abracadabra
a bracadabra
ab racadabra
abr acadabra
abrac adabra
abracad abra
Addr.
0
0
0
3
2
7
Length
0
0
0
1
1
4
deviating Symbol
'a'
'b'
'r'
'c'
'd'
''
Because each byte sequence is extended by the first symbol deviating from the former contents,
the set of already used symbols will continuously grow. No additional coding scheme is
necessary. This allows an easy implementation with minimum requirements to the encoder and
decoder.
Restrictions:
To keep runtime and buffering capacity in an acceptable range, the addressing must be limited to
a certain maximum. Contents exceeding this range will not be regarded for coding and will not
be covered by the size of the addressing pointer.
Compression Efficiency:
The achievable compression rate is only depending on repeating sequences. Other types of
redundancy like an unequal probability distribution of the set of symbols cannot be reduced. For
that reason the compression of a pure LZ77 implementation is relatively low.
A significant better compression rate can be obtained by combining LZ77 with an additional
entropy coding algorithm. An example would be Huffman or Shannon-Fano coding. The widespread Deflate compression method (e.g. for GZIP or ZIP) uses Huffman codes for instance.
Of these, LZ77 is probably the most straightforward. It tries to replace recurring patterns in the
data with a short code. The code tells the decompressor how many symbols to copy and from
where in the output to copy them. To compress the data, LZ77 maintains a history buffer which
contains the data that has been processed and tries to match the next part of the message to it. If
there is no match, the next symbol is output as-is. Otherwise an (offset,length) -pair is output.
Output
S
I
D
V
I
C
I
I
I
(9, 3)
History Lookahead
SIDVICIIISIDIDVI
S IDVICIIISIDIDVI
SI DVICIIISIDIDVI
SID VICIIISIDIDVI
SIDV ICIIISIDIDVI
SIDVI CIIISIDIDVI
SIDVIC IIISIDIDVI
SIDVICI IISIDIDVI
SIDVICII ISIDIDVI
SIDVICIII SIDIDVI
----match length:
|----9---|
match offset:
SIDVICIIISID IDVI
-- -match length:
|2|
match offset:
3
9
2
2
(2, 2)
SIDVICIIISIDID VI
--|----11----|
(11, 2) SIDVICIIISIDIDVI
match length: 2
match offset: 11
At each stage the string in the lookahead buffer is searched from the history buffer. The longest
match is used and the distance between the match and the current position is output, with the
match length. The processed data is then moved to the history buffer. Note that the history buffer
contains data that has already been output. In the decompression side it corresponds to the data
that has already been decompressed. The message becomes:
S I D V I C I I I (9,3) (2,2) (11,2)
The following describes what the decompressor does with this data.
History
S
SI
SID
SIDV
SIDVI
SIDVIC
SIDVICI
SIDVICII
SIDVICIII
|----9---|
SIDVICIIISID
|2|
SIDVICIIISIDID
|----11----|
SIDVICIIISIDIDVI
Input
S
I
D
V
I
C
I
I
I
(9,3)
-> SID
(2,2)
-> ID
(11,2) -> VI
In the decompressor the history buffer contains the data that has already been decompressed. If
we get a literal symbol code, it is added as-is. If we get an (offset,length) pair, the offset tells us
from where to copy and the length tells us how many symbols to copy to the current output
position. For example (9,3) tells us to go back 9 locations and copy 3 symbols to the current
output position. The great thing is that we don't need to transfer or maintain any other data
structure than the data itself.
Lempel-Ziv 1977
In 1977 Ziv and Lempel proposed a lossless compression method which replaces
phrases in the data stream by a reference to a previous occurrance of the phrase.
As long as it takes fewer bits to represent the reference and the phrase length than
the phrase itself, we get compression. Kind-of like the way BASIC substitutes tokens
for keywords.
LZ77-type compressors use a history buffer, which contains a fixed amount of symbols
output/seen so far. The compressor reads symbols from the input to a lookahead buffer and tries
to find as long as possible match from the history buffer. The length of the string match and the
location in the buffer (offset from the current position) is written to the output. If there is no
suitable match, the next input symbol is sent as a literal symbol.
Of course there must be a way to identify literal bytes and compressed data in the output. There
are lot of different ways to accomplish this, but a single bit to select between a literal and
compressed data is the easiest.
The basic scheme is a variable-to-block code. A variable-length piece of the message is
represented by a constant amount of bits: the match length and the match offset. Because the data
in the history buffer is known to both the compressor and decompressor, it can be used in the
compression. The decompressor simply copies part of the already decompressed data or a literal
byte to the current output position.
Variants of LZ77 apply additional compression to the output of the compressor, which include a
simple variable-length code (LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding
(ZIP 1.x)), all of which result in a certain degree of improvement over the basic scheme. This is
because the output values from the first stage are not evenly distributed, i.e. their probabilities
are not equal and statistical compression can do its part.
Lempel-Ziv-78 (LZ78)
One year after publishing LZ77 Jacob Ziv and Abraham Lempel hat introduced another
compression method ("Compression of Individual Sequences via Variable-Rate Coding").
Accordingly this procdure will be called LZ78.
Fundamental algorithm:
LZ78 is based on a dictionary that will be created dynamically at runtime. Both the encoding and
the decoding process use the same rules to ensure that an identical dictionary is available. This
dictionary contains any sequence already used to build the former contents. The compressed data
have the general form:
In contrast to LZ77 no combination of address and sequence length is used. Instead only the
index to the dictionary is stored. The mechanism to add the first deviating symbol remains from
LZ77.
Beispiel "abracadabra":
abracadabra
a bracadabra
ab racadabra
abr acadabra
abrac adabra
abracad abra
abracadab ra
index
0
0
0
1
1
1
3
deviating
symbol
'a'
'b'
'r'
'c'
'd'
'b'
'a'
new entry
dictionary
1 "a"
2 "b"
3 "r"
4 "ac"
5 "ad"
6 "ab"
7 "ra"
A LZ78 dictionary is slowly growing. For a relevant compression a larger amount of data must
be processed. Additionally the compression is mainly depending on the size of the dictionary.
But a larger dictionary requires higher efforts for addressing and administration both at runtime.
In practice the dictionary would be implemented as a tree to minimize the efforts for searching.
Starting with the current symbol the algorithm evaluates for every succeeding symbol whether it
is available in the tree. If a leaf node is found, the corresponding index will be written to the
compressed data. The decoder could be realized with a simple table, because the decoder does
not need the search function.
The size of the dictionary is growing during the coding process, so that the size for addressing
the table would increase continuously. In parallel the requirements for storing and searching
would be also enlarged permanently. A limitation of the dictionary and corresponding update
mechanisms are required.
LZ78 is the base for other compression methods like the wide-spread LZW used e.g. for GIF
graphics.
Lempel-Ziv 1978
One large problem with the LZ77 method is that it does not use the coding space
efficiently, i.e. there are length and offset values that never get used. If the history
buffer contains multiple copies of a string, only the latest occurrance is needed, but
they all take space in the offset value space. Each duplicate string wastes one offset
value.
To get higher efficiency, we have to create a real dictionary. Strings are added to the codebook
only once. There are no duplicates that waste bits just because they exist. Also, each entry in the
codebook will have a specific length, thus only an index to the codebook is needed to specify a
string (phrase). In LZ77 the length and offset values were handled more or less as disconnected
variables although there is correlation. Because they are now handled as one entity, we can
expect to do a little better in that regard also.
LZ78-type compressors use this kind of a dictionary. The next part of the message (the
lookahead buffer contents) is searched from the dictionary and the maximum-length match is
returned. The output code is an index to the dictionary. If there is no suitable entry in the
dictionary, the next input symbol is sent as a literal symbol. The dictionary is updated after each
symbol is encoded, so that it is possible to build an identical dictionary in the decompression
code without sending additional data.
Essentially, strings that we have seen in the data are added to the dictionary. To be able to
constantly adapt to the message statistics, the dictionary must be trimmed down by discarding
the oldest entries. This also prevents the dictionary from becaming full, which would decrease
the compression ratio. This is handled automatically in LZ77 by its use of a history buffer (a
sliding window). For LZ78 it must be implemented separately. Because the decompression code
updates its dictionary in sychronization with the compressor the code remains uniquely
decodable.
LZ78
"bed spreaders spread spreads on beds"
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Encoding
At the beginning of encoding the dictionary is empty. In order to explain the principle of
encoding, let's consider a point within the encoding process, when the dictionary already
contains some strings.
We start analyzing a new prefix in the charstream, beginning with an empty prefix. If its
corresponding string (prefix + the character after it -- P+C) is present in the dictionary, the prefix
is extended with the character C. This extending is repeated until we get a string which is not
present in the dictionary. At that point we output two things to the codestream: the code word
that represents the prefix P, and then the character C. Then we add the whole string (P+C) to the
dictionary and start processing the next prefix in the charstream.
A special case occurs if the dictionary doesn't contain even the starting one-character string (for
example, this always happens in the first encoding step). In that case we output a special code
word that represents an empty string, followed by this character and add this character to the
dictionary.
The output from this algorithm is a sequence of code word-character pairs (W,C). Each time a
pair is output to the codestream, the string from the dictionary corresponding to W is extended
with the character C and the resulting string is added to the dictionary. This means that when a
new string is being added, the dictionary already contains all the substrings formed by removing
characters from the end of the new string.
The encoding algorithm
ii.
iii.
P := empty;
if not:
i.
ii.
END.
Decoding
At the start of decoding the dictionary is empty. It gets reconstructed in the process of
decoding. In each step a pair code word-character -- (W,C) is read from the codestream.
The code word always refers to a string already present in the dictionary. The string.W
and C are output to the charstream and the string (string.W+C) is added to the dictionary.
After the decoding, the dictionary will look exactly the same as after the encoding.
The decoding algorithm
1. At the start the dictionary is empty;
2. W := next code word in the codestream;
3. C := the character following it;
4. output the string.W to the codestream (this can be an empty string), and then
output C;
5. add the string.W+C to the dictionary;
6. are there more code words in the codestream?
if not, END.
An example
The encoding process is presented in Table 1.
o The column Step indicates the number of the encoding step. Each encoding step
is completed when the step 3.b. in the encoding algorithm is executed.
o The column Pos indicates the current position in the input data.
o The column Dictionary shows what string has been added to the dictionary. The
index of the string is equal to the step number.
o The column Output presents the output in the form (W,C).
o The output of each step decodes to the string that has been added to the dictionary.
Charstream to be encoded:
Pos
Char
2
B
3
B
4
C
5
B
6
C
7
A
8
B
Pos
Dictionary
Output
1.
(0,A)
2.
(0,B)
3.
BC
(2,C)
4.
BCA
(3,A)
5.
BA
(2,A)
9
A
Lempel-Ziv-Welch (LZW)
The LZW compression method is derived from LZ78 as introduced by Jacob Ziv and Abraham
Lempel. It was invented by Terry A. Welch in 1984 who had published his considerations in the
article "A Technique for High-Performance Data Compression".
At that time Terry A. Welch was employed in a leading position at the Sperry Research Center.
The LZW method is covered by patents valid for a number of countries, e.g. in USA, Europe and
Japan. Meanwhile Unisys holds the rights, but there are probably more patents also from other
companies regarding LZW. Some of these patents expire in 2003 (USA) and 2004 (Europe,
Japane).
LZW is an important part of a variety of data formats. Graphic formats like gif, tif (optional) and
Postscript (optional) are using LZW for entropy coding.
Fundamental algorithm:
LZW is developing a dictionary that contains any byte sequence already coded. The compressed
data exceptionally consist of indices to this dictionary. Before starting, the dictionary is preset
with entries for the 256 single byte symbols. Any entry following represents sequences larger
than one byte.
The algorithm presented by Terry Welch defines mechanisms to create the dictionary and to
ensure that it will be identical for both the encoding and decoding process.
Arithmetic Coding
Arithmetic coding is the most efficient method to code symbols according to the probability of
their occurrence. The average code length corresponds exactly to the possible minimum given by
information theory. Deviations which are caused by the bit-resolution of binary code trees does
not exist.
In contrast to a binary Huffman code tree the arithmetic coding offers a clearly better
compression rate. Its implementation is more complex on the other hand.
Unfortunately the usage is restricted by patents. As far as known it is not allowed to use
arithmetic coding without acquiring licences.
Arithmetic coding is part of the JPEG data format. Alternative to Huffman coding it will be used
for final entropy coding. In spite of its less efficiency Huffman coding remains the standard due
to the legal restrictions mentioned above.
Arithmetic Coding
In arithmetic coding, a message is encoded as a real number in an interval from one to zero.
Arithmetic coding typically has a better compression ratio than Huffman coding, as it produces a
single symbol rather than several seperate codewords. Arithmetic coding is a lossless coding
technique. There are a few disadvantages of arithmetic coding. One is that the whole codeword
must be received to start decoding the symbols, and if there is a corrupt bit in the codeword, the
entire message could become corrupt. Another is that there is a limit to the precision of the
number which can be encoded, thus limiting the number of symbols to encode within a
codeword. There also exists many patents upon arithmetic coding, so the use of some of the
algorithms also call upon royalty fees.
Here is the arithmetic coding algorithm, with an example to aid understanding.
1. Start with an interval [0, 1), divided into subintervals of all possible symbols to appear
within a message. Make the size of each subinterval proportional to the frequency at
which it appears in the message. Eg:
Symbol
Probability
Interval
0.2
[0.0,
0.2)
0.3
[0.2,
0.5)
0.1
[0.5,
0.6)
0.4
[0.6,
1.0)
2. When encoding a symbol, "zoom" into the current interval, and divide it into subintervals
like in step one with the new range. Example: suppose we want to encode "abd". We
"zoom" into the interval corresponding to "a", and divide up that interval into smaller
subintervals like before. We now use this new interval as the basis of the next symbol
encoding step.
Symbol
[0.0, 0.04)
[0.04, 0.1)
[0.1, 0.102)
[0.102, 0.2)
3. Repeat the process until the maximum precision of the machine is reached, or all symbols
are encoded. To encode the next character "b", we zuse the "a" interval created before,
and zoom into the subinterval "b", and use that for the next step. This produces:
Symbol
[0.102, 0.1216)
[0.1216, 0.151)
[0.151, 0.1608)
[0.1608, 0.2)
[0.1608, 0.16864)
[0.16864, 0.1804)
[0.1804, 0.18432)
[0.18432, 0.2)
4. Transmit some number within the latest interval to send the codeword. The number of
symbols encoded will be stated in the protocol of the image format, so any number within
[0.1608, 0.2) will be acceptable.
To decode the message, a similar algorithm is followed, except that the final number is given,
and the symbols are decoded sequentially from that.
Algorithm Overview
Arithmetic coding is similar to Huffman coding; they both achieve their compression by
reducing the average number of bits required to represent a symbol.
Given:
An alphabet with symbols S0, S1, ... Sn, where each symbol has a probability of occurrence of p0,
p1, ... pn such that pi = 1.
From the fundamental theorem of information theory, it can be shown that the optimal coding for
Si requires -(pilog2(pi)) bits.
More often than not, the optimal number of bits is fractional. Unlike Huffman coding, arithmetic
coding provides the ability to represent symbols with fractional bits.
Since, pi = 1, we can represent each probability, pi, as a unique non-overlapping range of values
between 0 and 1. There's no magic in this, we're just creating ranges on a probability line.
For example, suppose we have an alphabet 'a', 'b', 'c', 'd', and 'e' with probabilities of occurrence
of 30%, 15%, 25%, 10%, and 20%. We can choose the following range assignments to each
symbol based on its probability:
TABLE 1. Sample Symbol Ranges
Symbol
Probability
Range
30%
[0.00, 0.30)
15%
[0.30, 0.45)
25%
[0.45, 0.70)
10%
[0.70, 0.80)
20%
[0.80, 1.00)
Where square brackets '[' and ']' mean the adjacent number is included and parenthesis '(' and ')'
mean the adjacent number is excluded.
Ranges assignments like the ones in this table can then be use for encoding and decoding strings
of symbols in the alphabet. Algorithms using ranges for coding are often referred to as range
coders.
Encoding Strings
By assigning each symbol its own unique probability range, it's possible to encode a single
symbol by its range. Using this approach, we could encode a string as a series of probability
ranges, but that doesn't compress anything. Instead additional symbols may be encoded by
restricting the current probability range by the range of a new symbol being encoded. The pseudo
code below illustrates how additional symbols may be added to an encoded string by restricting
the string's range bounds.
lower bound = 0
upper bound = 1
while there are still symbols to encode
current range = upper bound - lower bound
upper bound = lower bound + (current range upper bound of new symbol)
lower bound = lower bound + (current range lower bound of new symbol)
end while
Any value between the computed lower and upper probability bounds now encodes the input
string.
Example:
Encode the string "ace" using the probability ranges from Table 1.
Start with lower and upper probability bounds of 0 and 1.
Encode 'a'
current range = 1 - 0 = 1
upper bound = 0 + (1 0.3) = 0.3
lower bound = 0 + (1 0.0) = 0.0
Encode 'c'
The decoding process must start with a an encoded value representing a string. By definition, the
encoded value lies within the lower and upper probability range bounds of the string it
represents. Since the encoding process keeps restricting ranges (without shifting), the initial
value also falls within the range of the first encoded symbol. Successive encoded symbols may
be identified by removing the scaling applied by the known symbol. To do this, subtract out the
lower probability range bound of the known symbol, and multiply by the size of the symbols'
range.
Based on the discussion above, decoding a value may be performed following the steps in the
pseudo code below:
encoded value = encoded input
while string is not fully decoded
identify the symbol containing encoded value within its range
//remove effects of symbol from encoded value
current range = upper bound of new symbol - lower bound of new symbol
encoded value = (encoded value - lower bound of new symbol) current range
end while
Example:
Using the probability ranges from Table 1 decode the three character string encoded as 0.20.
Decode first symbol
0.20 is within [0.00, 0.30)