You are on page 1of 42

DATA COMPRESSION

The word data is in general used to mean the information in digital form on which computer
programs operate, and compression means a process of removing redundancy in the data. By
'compressing data', we actually mean deriving techniques or, more specifically, designing
efficient algorithms to:
represent data in a less redundant fashion
remove the redundancy in data
Implement compression algorithms, including both compression and decompression.
Data Compression means encoding the information in a file in such a way that it takes less space.
Compression is used just about everywhere. All the images you get on the web are compressed,
typically in the JPEG or GIF formats, most modems use compression, HDTV will be compressed
using MPEG-2, and several file systems automatically compress files when stored, and the rest
of us do it by hand. The task of compression consists of two components, an encoding algorithm
that takes a message and generates a compressed representation (hopefully with fewer bits),
and a decoding algorithm that reconstructs the original message or some approximation of it
from the compressed representation.
Compression denotes compact representation of data.
Examples for the kind of data we typically want to compress are e.g.
text
source-code
arbitrary files
images
video
audio data
speech

Why do we need compression ?


Compression Technology is employed to efficiently use storage space, to save on transmission
capacity and transmission time, respectively. Basically, its all about saving resources and money.
Despite of the overwhelming advances in the areas of storage media and transmission networks it
is actually quite a surprise that still compression technology is required. One important reason is
that also the resolution and amount of digital data has increased (e.g. HD-TV resolution, everincreasing sensor sizes in consumer cameras), and that there are still application areas where
resources are limited, e.g. wireless networks. Apart from the aim of simply reducing the amount
of data, standards like MPEG-4, MPEG-7, and MPEG-21 offer additional functionalities.

Why is it possible to compress data ?

Compression-enabling properties are:


Statistical redundancy: in non-compressed data, all symbols are represented
with the same number of bits independent of their relative frequency (fixed
length representation).

Correlation: adjacent data samples tend to be equal or similar (e.g. think of images or
video data).There are different types of correlation:

Spatial correlation
Spectral correlation
Temporal correlation
In addition, in many data types there is a significant amount of irrelevancy since the human brain
is not able to process and/or perceive the entire amount of data. As a consequence, such data can
be omitted without degrading perception. Furthermore, some data contain more abstract
properties which are independent of time, location, and resolution and can be described very
efficiently (e.g. fractal properties).
Compression techniques are broadly classified into two categories:

Lossless Compression
A compression approach is lossless only if it is possible to exactly reconstruct the original data
from the compressed version. There is no loss of any information during the compression
process.
For example, in Figure below, the input string AABBBA is reconstructed after the execution of
the compression algorithm followed by the decompression algorithm.
Lossless compression is called reversible compression since the original data may be recovered
perfectly by decompression.

Lossless compression techniques are used when the original data of a source are so important
that we cannot afford to lose any details. Examples of such source data are medical images, text
and images preserved for legal reason, some computer executable files, etc.
In lossless compression (as the name suggests) data are reconstructed after compression without
errors,i.e. no information is lost. Typical application domains where you do not want to loose
information is compression of text, files, fax. In case of image data, for medical imaging or the
compression of maps in the context of land registry no information loss can be tolerated. A
further reason to stick to lossless coding schemes instead of lossy ones is their lower
computational demand.
Lossless Compression typically is a process with three stages:

The model: the data to be compressed is analyzed with respect to its structure and the
relative frequency of the occurring symbols.
The encoder: produces a compressed bitstream / file using the information provided by
the model.
The adaptor: uses information extracted from the data (usually during encoding) in order
to adapt the model (more or less) continuously to the data.
The most fundamental idea in lossless compression is to employ codewords which are shorter (in
terms of their binary representation) than their corresponding symbols in case the symbols do
occur frequently. On the other hand, codewords are longer than the corresponding symbols in
case the latter do not occur frequently

Lossy data compression


A compression method is lossy if it is not possible to reconstruct the original exactly from the
compressed version. There are some insignificant details that may get lost during the process of
compression. The word insignificant here implies certain requirements to the quality of the
reconstructed data. Figure below shows an example where a long decimal number becomes a
shorter approximation after the compression-decompression process.

Lossy compression is called irreversible compression since it is impossible to recover the


original data exactly by decompression. Approximate reconstruction may be desirable since it
may lead to more effective compression. However, it often requires a good balance between the
visual quality and the computation complexity.
Data such as multimedia images, video and audio are more easily compressed by lossy
compression techniques because of the way that human visual and hearing systems work.

A lossy data compression method is one where compressing data and then decompressing it
retrieves data that may well be different from the original, but is "close enough" to be useful in
some way. Lossy data compression is used frequently on the Internet and especially in streaming
media and telephony applications. These methods are typically referred to as codecs in this
context. Most lossy data compression formats suffer from generation loss: repeatedly
compressing and decompressing the file will cause it to progressively lose quality. This is in
contrast with lossless data compression.

Lossless vs. lossy compression


The advantage of lossy methods over lossless methods is that in some cases a lossy method can
produce a much smaller compressed file than any known lossless method, while still meeting the
requirements of the application.
Lossy methods are most often used for compressing sound, images or videos. The compression
ratio (that is, the size of the compressed file compared to that of the uncompressed file) of lossy
video codecs are nearly always far superior to those of the audio and still-image equivalents.
Audio can be compressed at 10:1 with no noticeable loss of quality, video can be compressed
immensely with little visible quality loss, eg 300:1. Lossily compressed still images are often
compressed to 1/10th their original size, as with audio, but the quality loss is more noticeable,
especially on closer inspection.
When a user acquires a lossily-compressed file, (for example, to reduce download-time) the
retrieved file can be quite different from the original at the bit level while being indistinguishable
to the human ear or eye for most practical purposes. Many methods focus on the idiosyncrasies
of the human anatomy, taking into account, for example, that the human eye can see only certain
frequencies of light. The psycho-acoustic model describes how sound can be highly compressed
without degrading the perceived quality of the sound. Flaws caused by lossy compression that
are noticeable to the human eye or ear are known as compression artifacts.
Lossless compression algorithms usually exploit statistical redundancy in such a way as to
represent the sender's data more concisely, but nevertheless perfectly. Lossless compression is
possible because most real-world data has statistical redundancy. For example, in English text,
the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be
followed by the letter 'z' is very small.
Another kind of compression, called lossy data compression, is possible if some loss of fidelity is
acceptable. For example, a person viewing a picture or television video scene might not notice if
some of its finest details are removed or not represented perfectly. Similarly, two clips of audio
may be perceived as the same to a listener even though one is missing details found in the other.
Lossy data compression algorithms introduce relatively minor differences and represent the
picture, video, or audio using fewer bits.
Lossless compression schemes are reversible so that the original data can be reconstructed, while
lossy schemes accept some loss of data in order to achieve higher compression. However,

lossless data compression algorithms will always fail to compress some files; indeed, any
compression algorithm will necessarily fail to compress any data containing no discernible
patterns. Attempts to compress data that has been compressed already will therefore usually
result in an expansion, as will attempts to compress encrypted data.

Measure of Performance
Codeword: A binary string representing either the whole coded data or one coded data
symbol
Coded Bitstream: the binary string representing the whole coded data.
Lossless Compression: 100% accurate reconstruction of the original data
Lossy Compression: The reconstruction involves errors which may or may not be tolerable
Bit Rate: Average number of bits per original data element after compression

Variable length codes


Variable length codes are desirable for data compression because overall savings
may be achieved by assigning short codewords to frequently occurring symbols and
long codewords to rarely occurring ones.
For example, consider a variable length code (0, 100, 101, 110, 111) with lengths of
codewords (1, 3, 3, 3, 3) for alphabet (A, B, C, D, E), and a source string
BAAAAAAAC with frequencies for each symbol (7, 1, 1, 0, 0). The average number of
bits required is

This is almost a saving of half the number of bits compared to 3 bits/symbol using a
3 bit fixed length code.
The shorter the codewords, the shorter the total length of a source file. Hence the
code would be a better one from the compression point of view.

Unique decodability
Variable length codes are useful for data compression. However, a variable length
code would be useless if the codewords could not be identified in a unique way from
the encoded message.
Example Consider the variable length code (0, 10, 010, 101) for alphabet (A, B, C,
D). A segment of encoded message such as '0100101010' can be decoded in more
than one way. For example, '0100101010' can be interpreted in at least two ways, '0
10 010 101 0' as ABCDA or '010 0 101 010' as CADC.
A code is uniquely decodable if there is only one possible way to decode encoded
messages. The code (0, 10, 010, 101) in Example above is not uniquely decodable
and therefore cannot be used for data compression.

Prefix codes and binary trees


Codes with the self-punctuating property do exist. A type of so-called prefix code
can be identified by checking its so-called prefix-free property or prefix property for
short.
A prefix is the first few consecutive bits of a codeword. When two codewords are of
different lengths, it is possible that the shorter codeword is identical to the first few
bits of the longer codeword. In this case, the shorter codeword is said to be a prefix
of the longer one.
Example 2.3 Consider two binary codewords of different length: C1= 010 (3 bits)
and C2 = 01011 (5 bits).
The shorter codeword C1 is the prefix of the longer code C2 as C2 =010 11.
Codeword C2 can be obtained by appending two more bits 11 to C1.
The prefix property of a binary code is the fact that no codeword is a prefix of
another.

Prefix codes and unique decodability


Prefix codes are a subset of the uniquely decodable codes. This means that all
prefix codes are uniquely decodable. If a code is a prefix code, the code is then
uniquely decodable.

However, if a code is not a prefix code, we cannot conclude that the code is not
uniquely decodable. This is because other types of code may also be uniquely
decodable.
Example Consider code (0, 01, 011, 0111) for (A, B, C, D). This is not a prefix code
as the first codeword 0 is the prefix of the others,
However, given an encoded message 01011010111, there is no ambiguity and only
one way to decode it: 01 011 01 0111, i.e. BCBD. Each 0 offers a means of selfpunctuating in this example. We only need to watch out the O, the beginning of
each codeword and the bit 1 before any O, the last bit of the codeword.
Some codes are uniquely decodable but require looking ahead during the decoding
process. This makes them not as efficient as prefix codes.

Static Huffman coding


Huffman coding is a successful compression method used originally for text
compression. In any text, some characters occur far more frequently than others.
For example, in English text, the letters E, A, O, T are normally used much more
frequently than J, Q, X.
Huffman's idea is, instead of using a fixed-length code such as 8 bit extended ASCII
or EBCDIC for each symbol, to represent a frequently occurring character in a source
with a shorter codeword and to represent a less frequently occurring one with a
longer codeword. Hence the total number of bits of this representation is
significantly reduced for a source of symbols with different frequencies. The number
of bits required is reduced for each symbol on average.

Static Huffman coding assigns variable length codes to symbols based on


their frequency of occurrences in the given message. Low frequency symbols
are encoded using many bits, and high frequency symbols are encoded using
fewer bits.

The message to be transmitted is first analyzed to find the relative


frequencies of its constituent characters.

The coding process generates a binary tree, the Huffman code tree, with
branches labeled with bits (0 and 1).

The Huffman tree (or the character codeword pairs) must be sent with the
compressed information to enable the receiver decode the message.

Static Huffman Coding Algorithm


Find the frequency of each character in the file to be compressed;

For each distinct character create a one-node binary tree containing the character
and its frequency as its priority;
Insert the one-node binary trees in a priority queue in increasing order of frequency;
while (there are more than one tree in the priority queue) {
dequeue two trees t1 and t2;
Create a tree t that contains t1 as its left subtree and t2 as its right subtree; // 1
priority (t) = priority(t1) + priority(t2);
insert t in its proper location in the priority queue; // 2
}
Assign 0 and 1 weights to the edges of the resulting tree, such that the left and
right edge of each node do not have the same weight; // 3
Note: The Huffman code tree for a particular set of characters is not
unique.
(Steps 1, 2, and 3 may be done differently).
Example: : Information to be transmitted over the internet contains the following
t

Character

5
3

22

18

45

13

65

45

Frequenc
y

characters with their associated frequencies:


Use Huffman technique to answer the following questions:
Build the Huffman code tree for the message.
Use the Huffman tree to find the codeword for each character.
If the data consists of only these characters, what is the total number
of bits to be transmitted? What is the compression ratio?
Verify that your computed Huffman codewords satisfy the Prefix property.
Solution:
Sort the list of characters in increasing order of frequency.

Now create Huffman tree.

Final Huffman Tree

Now assign codes to edges in the tree. Put left edge as 0 and right edge as 1

The sequence of zeros and ones that are the arcs in the path from the root to each
leaf node are the desired codes:
t

character

00

010

0111

111

0110

10

110

Huffman codeword

If we assume the message consists of only the characters a,e,l,n,o,s,t then the
number of bits for the compressed message will be 696:

If the message is sent uncompressed with 8-bit ASCII representation for the
characters, we have 261*8 = 2088 bits.
Assuming that the number of character-codeword pairs and the pairs are included
at the beginning of the binary file containing the compressed message in the
following format:

Number of bits for the transmitted file = bits(7) + bits(characters) +


its(codewords) + bits(compressed message)

= 3 + (7*8) + 21 + 696 = 776

Compression ratio = bits for ASCII representation / number of bits transmitted


= 2088 / 776 = 2.69
Thus, the size of the transmitted file is 100 / 2.69 = 37% of the original ASCII file

The Prefix Property


Data encoded using Huffman coding is uniquely decodable. This is because
Huffman codes satisfy an important property called the prefix property. In a given
set of Huffman codewords, no codeword is a prefix of another Huffman codeword
For example, in a given set of Huffman codewords, 10 and 101 cannot
simultaneously be valid Huffman codewords because the first is a prefix of the
second.
We can see by inspection that the codewords we generated in the previous example
are valid Huffman codewords.
To see why the prefix property is essential, consider the codewords given below in
which e is encoded with 110 which is a prefix of f
character

codeword

101

100

111

110

1100

The decoding of 11000100110 is ambiguous:


11000100110
11000100110

=> face
=> eaace

Optimal Huffman codes


Huffman codes are optimal when probabilities of the source symbols are all
negative powers of two. Examples of a negative power of two are , , etc.
The conclusion can be drawn from the following justification.
Suppose that the lengths of the Huffman code are L= (l1,l2,l3,ln) for a source P
= (p1, p2,pn) , where n is the size of the alphabet.
Using a variable length code to the symbols, lj bits for pj, the average length of the
codewords is (in bits):

A code is optimal if the average length of the codewords equals the entropy of the
source.
Let

And notice

This equation holds if and only if lj = -log2p j for all j = 1 , 2 , . . n , because lj has
to be an integer (in bits). Since the length lj has to be an integer (in bits) for
Huffman codes, - log2 pj has to be an integer, too. Of course, - log2 pj cannot be an
integer unless pj is a negative power of 2, for all j = 1 , 2 , . . . ,n.
In other words, this can only happen if all probabilities are negative powers of 2 in
Huffman codes, for lj has to be an integer (in bits).
For example, for a source P (1/2' 1/4' 1/8' 1/8) . Huffman codes for the source can
be optimal
Optimality of Huffman coding
We show that the prefix code generated by the Huffman coding algorithm is optimal
in the sense of minimizing the expected code length among all binary prefix codes
for the input alphabet.

The leaf merges operation


A key ingredient in the proof involves constructing a new tree from an existing
binary coding tree T by eliminating two sibling leaves a1 and a2, replacing them by
their parent node, labeled with the sum of the probabilities of a1 and a2. We will
denote the new tree obtained in this way merge(T, a1, a2). Likewise, if A is the
alphabet of symbols for T, that is, the set of leaves of T, then we define a new
alphabet A0, denoted merge(A, a1, a2), as the alphabet consisting of all symbols of A
other than a1 and a2, together with a new symbol a that represents a1 and a2
combined.
Two important observations related to these leaf merging operations follow.
1. Let T be the Huffman tree for an alphabet A, and let a1 and a2 be the two
symbols of lowest probability in A. Let T` be the Huffman tree for the reduced
alphabet A`= merge(A, a1, a2).Then T` = merge(T, a1, a2). In other words,
the Huffman tree for the merged alphabet is the merge of the Huffman tree
for the original alphabet. This is true simply by the definition of the Huffman
procedure.
2. The expected code length for T exceeds that of merge(T, a1, a2) by precisely
the sum p of the probabilities of the leaves a1 and a2. This is because in the
sum that defines the expected code length for the merged tree, the term dp,
where d is the depth of the parent a of these leaves, is replaced by two terms
(corresponding to the leaves) which sum to (d + 1)p.

Proof of optimality of Huffman coding


With the above comments in mind, we now give the formal optimality proof. We
show by induction in the size, n, of the alphabet, A, that the Huffman coding

algorithm returns a binary prefix code of lowest expected code length among all
prefix codes for the input alphabet.
(Basis) If n = 2, the Huffman algorithm finds the optimal prefix code, which
assigns 0 to one symbol of the alphabet and 1 to the other.
(Induction Hypothesis) For some n 2, Huffman coding returns an optimal prefix
code for any input alphabet containing n symbols.
(Inductive Step) Assume that the Induction Hypothesis (IH) holds for some value
of n. Let A be an input alphabet containing n + 1 symbols. We show that Huffman
coding returns an optimal prefix code for A.
Let a1 and a2 be the two symbols of smallest probability in A. Consider the merged
alphabet A` =merge(A, a1, a2) as defined above. By the IH, the Huffman tree T` for
this merged alphabet A` is optimal. We also know that T` is in fact the same as the
tree merge(T, a1, a2) that results from the Huffman tree T for the original alphabet
A by replacing a1 and a2 with their parent node. Furthermore, the expected code
length L for T exceeds the expected code length L` for T` by exactly the sum p of
the probabilities of a1 and a2.
We claim that no binary coding tree for A has an expected code length less than L =
L` +p.
Let T2 be any tree of lowest expected code length for A. Without loss of generality,
a1 and a2 will be leaves at the deepest level of T2, since otherwise one could swap
them with shallower leaves and reduce the code length even further. Furthermore,
we may assume that a1 and a2 are siblings in T2. Therefore, we obtain from T2 a
coding tree T2` for the merged alphabet A` through the merge procedure described
above, replacing a1 and a2 by their parent labeled with`the sum of their
probabilities:
T2` =merge(T2, a1, a2). By the observation above, the expected code lengths L2
and L2` of T2 and T2` respectively satisfy L2 = L2` + p. But by the IH, T` is optimal
for the alphabet A`. Therefore,
L2` L`. It follows that L2 = L2` + p L` + p = L,
Which shows that T is optimal as claimed.

Minimum Variance Huffman Codes


The Huffman coding algorithm has some flexibility when two equal frequencies are found. The
choice made in such situations will change the final code including possibly the code length of
each message. Since all Huffman codes are optimal, however, it cannot change the average
length.

For example, consider the following message probabilities, and codes. symbol probability code 1
code 2

Both codings produce an average of 2.2 bits per symbol, even though the lengths are quite
different in the two codes. Given this choice, is there any reason to pick one code over the other?
For some applications it can be helpful to reduce the variance in the code length. The variance
is defined as

With lower variance it can be easier to maintain a constant character transmission rate, or reduce
the size of buffers. In the above example, code 1 clearly has a much higher variance than code 2.
It turns out that a simple modification to the Huffman algorithm can be used to generate a code
that has minimum variance. In particular when choosing the two nodes to merge and there is a
choice based on weight, always pick the node that was created earliest in the algorithm. Leaf
nodes are assumed to be created before all internal nodes. In the example above, after dand e
are joined, the pair will have the same probability as c and a (.2), but it was created afterwards,
so we join c and a. Similarly we select b instead of ac to join with de since it was created earlier.
This will give code 2 above, and the corresponding Huffman tree in Figure below

Extended Huffman coding


One problem with Huffman codes is that they meet the entropy bound only when
all probabilities are powers of 2. What would happen if the alphabet is binary, e.g.
S = (a,b)? The only optimal case 3 is when P = (Pa,Pb), Pa = and Pb = . Hence,
Huffman codes can be bad.
For Example:
Consider a situation when Pa = 0.8 and Pb = 0.2.
Solution: Since Huffman coding needs to use 1 bit per symbol at least, to encode
the input, the Huffman codewords are 1 bit per symbol on average:

However, the entropy of the distribution is

The efficiency of the code is

This gives a gap of 1- 0.72 = 0.28 bit. The performance of the Huffman encoding
algorithm is, therefore, (0.28/1) = 28% worse than optimal in this case.
The idea of extended Huffman coding is to encode a sequence of source symbols
instead of individual symbols. The alphabet size of the source is artificially
increased in order to improve the code efficiency. For example, instead of
assigning a codeword to every individual symbol for a source alphabet, we derive a
codeword for every two symbols.
The following example shows how to achieve this:

Example 4.8 Create a new alphabet S`= (aa, ab, ba, bb) extended from S = (a, b).
Let aa = A, ab = B, ba = C and bb = D. We now have an extended alphabet S' = (A,
B, C, D). Each symbol in the alphabet S` is a combination of two symbols from the
original alphabet S. The size of the alphabet S` increases to 22 = 4 .
Suppose symbol 'a' or 'b' occurs independently. The probability distribution for S`,
the extended alphabet, can be calculated as below:
PA = Pa X Pa = 0.64
PB = Pa X Pb = 0.16
PC = Pb X Pa = 0.16
PD = Pb X Pb = 0.04
We then follow the normal static Huffman encoding algorithm to derive the
Huffman code for S.
The canonical minimum-variance code for S' is (0, 11, 100, 101), for A, B,C, D
respectively. The average length is 1.56 bits for two symbols.
The original output became 1.56/2 = 0.78 bit per symbol. The efficiency of the
code has been increased to 0.72//0.78 ~ 92%. This is only ( 0.7 8 -0.72)/0.78 ~ 8%
worse than optimal.
Dynamic/Adaptive Huffman Coding
For Huffman coding, we need to know the probabilities for the individual symbols
(and we also need to know the joint probabilities for blocks of symbols in extended
Huffman coding). If the probability distributions are not known for a file of
characters to be transmitted, they have to be estimated first and the code itself has
to be included in the transmission of the coded file. In dynamic Huffman coding a
particular probability (frequency of symbols) distribution is assumed at the
transmitter and receiver and hence a Huffman code is available to start with. As
source symbols come in to be coded the relative frequency of the different symbols
is updated at both the transmitter and the receiver, and corresponding to this the
code itself is updated. In this manner the code continuously adapts to the nature of
the source distribution, which may change as time progresses and different files are
being transferred.

Dynamic Huffman coding is the basis of data compression algorithms used in V


series modems for transmission over the PSTN. In particular the MNP (Microcom
Networking Protocol) Class 5 protocol commonly found in modems such as the
V.32bis modem uses both CRC error detection and dynamic Huffman coding
for data compression.
Adaptive Approach
In the adaptive Huffman coding, an alphabet and frequencies of its symbols are
collected and maintained dynamically according to the source file on each
iteration. The Huffman tree is also updated based on the alphabet and frequencies
dynamically. When the encoder and decoder are at different locations, both
maintain an identical Huffman tree for each step independently. Therefore, there is
no need transferring the Huffman tree.
During the compression process, the Huffman tree is updated each time after a
symbol is read. The codeword(s) for the symbol is output immediately. For
convenience of discussion, the frequency of each symbol is called the weight of the
symbol to reflect the change of the frequency count at each stage.
The output of the adaptive Huffman encoding consists of Huffman codewords as
well as fixed length codewords. For each input symbol, the output can be a
Huffman codeword based on the Huffman tree in the previous step or a codeword
of a fixed length code such as ASCII. Using a fixed length codeword as the output
is necessary when a new symbol is read for the first time. In this case, the Huffman
tree does not include the symbol yet. It is therefore reasonable to output the
uncompressed version of the symbol. If the source file consists of ASCII, then the
fixed length codeword would simply be the uncompressed version of the symbol.
In the encoding process, for example, the model outputs a codeword of a fixed
length code such as ASCII code, if the input symbol has been seen for the first
time. Otherwise, it outputs a Huffman codeword.
However, a mixture of the fixed length and variable length codewords can cause
problems in the decoding process. The decoder needs to know whether the
codeword should be decoded according to a Huffman tree or by a fixed length

codeword before taking a right approach. A special symbol as a flag, therefore, is


used to signal a switch from one type of codeword to another.
Let the current alphabet be the subset S = ( , s l , s 2 , . . . ,Sn) of some alphabet
and g(si ) be any fixed length codeword for si (e.g. ASCII code), i = 1, 2 , . . . . To
indicate whether the output codeword is a fixed length or a variable length
codeword, one special symbol (does not belongs to ) is defined as a flag or a
shift key and to be placed before the fixed length codeword for communication
between the compressor and decompressor.
The compression algorithm maintains a subset S of symbols of some alphabet
(S is subset of ) that the system has seen so far. Huffman code (i.e. the Huffman
tree) for all the symbols in S is also maintained. Let the weight of always be 0
and the weight of any other symbol in S be its frequency so far. For convenience,
we represent the weight of each symbol by a number in round brackets. For
example, A(1) means that symbol A has a weight of 1.
Initially, S = {} and the Huffman tree has the single node of symbol (see Figure
below step (0)). During the encoding process, the alphabet S grows in number of
symbols each time a new symbol is read. The weight of a new symbol is always 1
and the weight of an existing symbol in S is increased by 1 when the symbol is
read. The Huffman tree is used to assign codewords to the symbols in S and is
updated after each output.
The following example shows the idea of the adaptive Huffman coding.
Example : Suppose that the source file is a string ABBBAC. Figure below shows
states of each step of the adaptive Huffman encoding algorithm.

Disadvantages of Huffman algorithms


Adaptive Huffman coding has the advantage of requiring no preprocessing and the
low overhead of using the uncompressed version of the symbols only at their first
occurrence.
The algorithms can be applied to other types of files in addition to text files.
The symbols can be objects or bytes in executable files.
Huffman coding, either static or adaptive, has two disadvantages that remain
unsolved:
Disadvantage 1: It is not optimal unless all probabilities are negative powers of 2.
This means that there is a gap between the average number of bits and the entropy
in most cases.
Recall the particularly bad situation for binary alphabets. Although by grouping
symbols and extending the alphabet, one may come closer to the optimal, the
blocking method requires a larger alphabet to be handled. Sometimes, extended
Huffman coding is not that effective at all.
Disadvantage 2: Despite the availability of some clever methods for counting the
frequency of each symbol reasonably quickly, it can be very slow when rebuilding
the entire tree for each symbol. This is normally the case when the alphabet is big
and the probability distributions change rapidly with each symbol.

RICE CODES
Rice Encoding (a special case of Golomb Coding) can be applied to reduce the bits
required to represent the lower value numbers. Rice's algorithm seemed easy to
implement
Named after Robert Rice, Rice coding is a specialised form of Golomb coding. It's
used to encode strings of numbers with a variable bit length for each number. If
most of the numbers are small, fairly good compression can be achieved. Rice
coding is generally used to encode entropy in an audio/video codec.

Rice coding depends on a parameter k and works the same as Golomb coding with a parameter m
where m = 2k. To encode a number, x:

1. Let q = x / m (round fractions down)


Write out q binary ones.
2. Write out a binary zero.
(Some people prefer to do it the other way - zeroes followed by a one)
3. Write out the last k bits of x
Decoding works the same way, just backwards. You can see my implementation here:

Algorithm Overview
Given a constant M, any symbol S can be represented as a quotient (Q) and remainder (R), where:
S = Q M + R.
If S is small (relative to M) then Q will also be small. Rice encoding is designed to reduce the
number of bits required to represent symbols where Q is small.
Rather than representing both Q and R as binary values, Rice encoding represents Q as a unary
value and R as a binary value.
For those not familiar with unary notation, a value N may be represented by N 1s followed by a 0.
Example: 3 = 1110 and 5 = 111110.
Note: The following is true for binary values, if log2(M) = K where K is an integer:
1. Q = S >> K (S left shifted K bits)
2. R = S & (M - 1) (S bitwise ANDed with (M - 1))
3. R can be represented using K bits.

Encoding
Rice coding is fairly straightforward.
Given a bit length, K. Compute the modulus, M using by the equation M = 2K. Then do following
for each symbol (S):
1. Write out S & (M - 1) in binary.
2. Write out S >> K in unary.

That's it. I told you it was straightforward.

Example:
Encode the 8-bit value 18 (0b00010010) when K = 4 (M = 16)
1. S & (M - 1) = 18 & (16 - 1) = 00010010 & 1111 = 0010
2. S >> K 18 >> 4 = 0b00010010 >> 4 = 0b0001 (10 in unary)

So the encoded value is 100010, saving 2 bits.

Decoding
Decoding isn't any harder than encoding.
As with encoding, given a bit length, K. Compute the modulus, M using by the equation M = 2K .
Then do following for each encoded symbol (S):
1. Determine Q by counting the number of 1s before the first 0.
2. Determine R reading the next K bits as a binary value.
3. Write out S as Q M + R.

Example:
Decode the encoded value 100010 when K = 4 (M = 16)
1. Q = 1
2. R = 0b0010 = 2
3. S = Q M + R = 1 16 + 2 = 18
Rice coding only works well when symbols are encoded with small values of Q.
Since Q is unary, encoded symbols can become quite large for even slightly large
Qs. It takes 8 bits just to represent the value 7. One way to improve the
compression obtained by Rice coding on generic data is to apply reversible
transformation on the data that reduces the average value of a symbol. The
Burrows-Wheeler Transform (BWT) with Move-To-Front (MTF) encoding is such a
transform.

Developement:

Jacob Ziv and Abraham Lempel had introduced a simple and efficient compression method
published in their article "A Universal Algorithm for Sequential Data Compression". This
algorithm is referred to as LZ77 in honour to the authors and the publishing date 1977.

Fundamentals:
LZ77 is a dictionary based algorithm that addresses byte sequences from former contents instead
of the original data. In general only one coding scheme exists, all data will be coded in the same
form:

Address to already coded contents

Sequence length

First deviating symbol

If no identical byte sequence is available from former contents, the address 0, the sequence
length 0 and the new symbol will be coded.

Example "abracadabra":
abracadabra
a bracadabra
ab racadabra
abr acadabra
abrac adabra
abracad abra

Addr.
0
0
0
3
2
7

Length
0
0
0
1
1
4

deviating Symbol
'a'
'b'
'r'
'c'
'd'
''

Because each byte sequence is extended by the first symbol deviating from the former contents,
the set of already used symbols will continuously grow. No additional coding scheme is
necessary. This allows an easy implementation with minimum requirements to the encoder and
decoder.

Restrictions:
To keep runtime and buffering capacity in an acceptable range, the addressing must be limited to
a certain maximum. Contents exceeding this range will not be regarded for coding and will not
be covered by the size of the addressing pointer.

Compression Efficiency:
The achievable compression rate is only depending on repeating sequences. Other types of
redundancy like an unequal probability distribution of the set of symbols cannot be reduced. For
that reason the compression of a pure LZ77 implementation is relatively low.

A significant better compression rate can be obtained by combining LZ77 with an additional
entropy coding algorithm. An example would be Huffman or Shannon-Fano coding. The widespread Deflate compression method (e.g. for GZIP or ZIP) uses Huffman codes for instance.

Of these, LZ77 is probably the most straightforward. It tries to replace recurring patterns in the
data with a short code. The code tells the decompressor how many symbols to copy and from
where in the output to copy them. To compress the data, LZ77 maintains a history buffer which
contains the data that has been processed and tries to match the next part of the message to it. If
there is no match, the next symbol is output as-is. Otherwise an (offset,length) -pair is output.
Output
S
I
D
V
I
C
I
I
I
(9, 3)

History Lookahead
SIDVICIIISIDIDVI
S IDVICIIISIDIDVI
SI DVICIIISIDIDVI
SID VICIIISIDIDVI
SIDV ICIIISIDIDVI
SIDVI CIIISIDIDVI
SIDVIC IIISIDIDVI
SIDVICI IISIDIDVI
SIDVICII ISIDIDVI
SIDVICIII SIDIDVI
----match length:
|----9---|
match offset:
SIDVICIIISID IDVI
-- -match length:
|2|
match offset:

3
9
2
2

(2, 2)

SIDVICIIISIDID VI
--|----11----|
(11, 2) SIDVICIIISIDIDVI

match length: 2
match offset: 11

At each stage the string in the lookahead buffer is searched from the history buffer. The longest
match is used and the distance between the match and the current position is output, with the
match length. The processed data is then moved to the history buffer. Note that the history buffer
contains data that has already been output. In the decompression side it corresponds to the data
that has already been decompressed. The message becomes:
S I D V I C I I I (9,3) (2,2) (11,2)

The following describes what the decompressor does with this data.
History
S
SI
SID
SIDV
SIDVI
SIDVIC
SIDVICI
SIDVICII
SIDVICIII
|----9---|
SIDVICIIISID
|2|
SIDVICIIISIDID
|----11----|
SIDVICIIISIDIDVI

Input
S
I
D
V
I
C
I
I
I
(9,3)

-> SID

(2,2)

-> ID

(11,2) -> VI

In the decompressor the history buffer contains the data that has already been decompressed. If
we get a literal symbol code, it is added as-is. If we get an (offset,length) pair, the offset tells us
from where to copy and the length tells us how many symbols to copy to the current output
position. For example (9,3) tells us to go back 9 locations and copy 3 symbols to the current
output position. The great thing is that we don't need to transfer or maintain any other data
structure than the data itself.
Lempel-Ziv 1977
In 1977 Ziv and Lempel proposed a lossless compression method which replaces
phrases in the data stream by a reference to a previous occurrance of the phrase.
As long as it takes fewer bits to represent the reference and the phrase length than
the phrase itself, we get compression. Kind-of like the way BASIC substitutes tokens
for keywords.

LZ77-type compressors use a history buffer, which contains a fixed amount of symbols
output/seen so far. The compressor reads symbols from the input to a lookahead buffer and tries
to find as long as possible match from the history buffer. The length of the string match and the
location in the buffer (offset from the current position) is written to the output. If there is no
suitable match, the next input symbol is sent as a literal symbol.

Of course there must be a way to identify literal bytes and compressed data in the output. There
are lot of different ways to accomplish this, but a single bit to select between a literal and
compressed data is the easiest.
The basic scheme is a variable-to-block code. A variable-length piece of the message is
represented by a constant amount of bits: the match length and the match offset. Because the data
in the history buffer is known to both the compressor and decompressor, it can be used in the
compression. The decompressor simply copies part of the already decompressed data or a literal
byte to the current output position.
Variants of LZ77 apply additional compression to the output of the compressor, which include a
simple variable-length code (LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding
(ZIP 1.x)), all of which result in a certain degree of improvement over the basic scheme. This is
because the output values from the first stage are not evenly distributed, i.e. their probabilities
are not equal and statistical compression can do its part.

Lempel-Ziv-78 (LZ78)
One year after publishing LZ77 Jacob Ziv and Abraham Lempel hat introduced another
compression method ("Compression of Individual Sequences via Variable-Rate Coding").
Accordingly this procdure will be called LZ78.

Fundamental algorithm:

LZ78 is based on a dictionary that will be created dynamically at runtime. Both the encoding and
the decoding process use the same rules to ensure that an identical dictionary is available. This
dictionary contains any sequence already used to build the former contents. The compressed data
have the general form:

Index addressing an entry of the dictionary

First deviating symbol

In contrast to LZ77 no combination of address and sequence length is used. Instead only the
index to the dictionary is stored. The mechanism to add the first deviating symbol remains from
LZ77.

Beispiel "abracadabra":

abracadabra
a bracadabra
ab racadabra
abr acadabra
abrac adabra
abracad abra
abracadab ra

index
0
0
0
1
1
1
3

deviating
symbol
'a'
'b'
'r'
'c'
'd'
'b'
'a'

new entry
dictionary
1 "a"
2 "b"
3 "r"
4 "ac"
5 "ad"
6 "ab"
7 "ra"

A LZ78 dictionary is slowly growing. For a relevant compression a larger amount of data must
be processed. Additionally the compression is mainly depending on the size of the dictionary.
But a larger dictionary requires higher efforts for addressing and administration both at runtime.

In practice the dictionary would be implemented as a tree to minimize the efforts for searching.
Starting with the current symbol the algorithm evaluates for every succeeding symbol whether it
is available in the tree. If a leaf node is found, the corresponding index will be written to the
compressed data. The decoder could be realized with a simple table, because the decoder does
not need the search function.

The size of the dictionary is growing during the coding process, so that the size for addressing
the table would increase continuously. In parallel the requirements for storing and searching
would be also enlarged permanently. A limitation of the dictionary and corresponding update
mechanisms are required.

LZ78 is the base for other compression methods like the wide-spread LZW used e.g. for GIF
graphics.

Lempel-Ziv 1978
One large problem with the LZ77 method is that it does not use the coding space
efficiently, i.e. there are length and offset values that never get used. If the history
buffer contains multiple copies of a string, only the latest occurrance is needed, but
they all take space in the offset value space. Each duplicate string wastes one offset
value.

To get higher efficiency, we have to create a real dictionary. Strings are added to the codebook
only once. There are no duplicates that waste bits just because they exist. Also, each entry in the
codebook will have a specific length, thus only an index to the codebook is needed to specify a
string (phrase). In LZ77 the length and offset values were handled more or less as disconnected
variables although there is correlation. Because they are now handled as one entity, we can
expect to do a little better in that regard also.
LZ78-type compressors use this kind of a dictionary. The next part of the message (the
lookahead buffer contents) is searched from the dictionary and the maximum-length match is
returned. The output code is an index to the dictionary. If there is no suitable entry in the
dictionary, the next input symbol is sent as a literal symbol. The dictionary is updated after each
symbol is encoded, so that it is possible to build an identical dictionary in the decompression
code without sending additional data.
Essentially, strings that we have seen in the data are added to the dictionary. To be able to
constantly adapt to the message statistics, the dictionary must be trimmed down by discarding
the oldest entries. This also prevents the dictionary from becaming full, which would decrease
the compression ratio. This is handled automatically in LZ77 by its use of a history buffer (a
sliding window). For LZ78 it must be implemented separately. Because the decompression code
updates its dictionary in sychronization with the compressor the code remains uniquely
decodable.

LZ78
"bed spreaders spread spreads on beds"
17
18
19
20
21

22
23
24
25
26
27
28
29
30
31
32
33

Encoding
At the beginning of encoding the dictionary is empty. In order to explain the principle of
encoding, let's consider a point within the encoding process, when the dictionary already
contains some strings.
We start analyzing a new prefix in the charstream, beginning with an empty prefix. If its
corresponding string (prefix + the character after it -- P+C) is present in the dictionary, the prefix
is extended with the character C. This extending is repeated until we get a string which is not
present in the dictionary. At that point we output two things to the codestream: the code word
that represents the prefix P, and then the character C. Then we add the whole string (P+C) to the
dictionary and start processing the next prefix in the charstream.
A special case occurs if the dictionary doesn't contain even the starting one-character string (for
example, this always happens in the first encoding step). In that case we output a special code
word that represents an empty string, followed by this character and add this character to the
dictionary.
The output from this algorithm is a sequence of code word-character pairs (W,C). Each time a
pair is output to the codestream, the string from the dictionary corresponding to W is extended
with the character C and the resulting string is added to the dictionary. This means that when a
new string is being added, the dictionary already contains all the substrings formed by removing
characters from the end of the new string.
The encoding algorithm

1. At the start, the dictionary and P are empty;


2. C := next character in the charstream;
3. Is the string P+C present in the dictionary?
a. if it is, P := P+C (extend P with C);
b. if not,
i.

output these two objects to the codestream:

the code word corresponding to P (if P is empty, output a zero);

C, in the same form as input from the charstream;

ii.

add the string P+C to the dictionary;

iii.

P := empty;

c. are there more characters in the charstream?

if yes, return to step 2;

if not:
i.

if P is not empty, output the code word corresponding to P;

ii.

END.

Decoding
At the start of decoding the dictionary is empty. It gets reconstructed in the process of
decoding. In each step a pair code word-character -- (W,C) is read from the codestream.
The code word always refers to a string already present in the dictionary. The string.W
and C are output to the charstream and the string (string.W+C) is added to the dictionary.
After the decoding, the dictionary will look exactly the same as after the encoding.
The decoding algorithm
1. At the start the dictionary is empty;
2. W := next code word in the codestream;
3. C := the character following it;

4. output the string.W to the codestream (this can be an empty string), and then
output C;
5. add the string.W+C to the dictionary;
6. are there more code words in the codestream?

if yes, go back to step 2;

if not, END.

An example
The encoding process is presented in Table 1.
o The column Step indicates the number of the encoding step. Each encoding step
is completed when the step 3.b. in the encoding algorithm is executed.
o The column Pos indicates the current position in the input data.
o The column Dictionary shows what string has been added to the dictionary. The
index of the string is equal to the step number.
o The column Output presents the output in the form (W,C).
o The output of each step decodes to the string that has been added to the dictionary.
Charstream to be encoded:
Pos

Char

2
B

3
B

4
C

5
B

6
C

7
A

8
B

Table 1: The encoding process


Step

Pos

Dictionary

Output

1.

(0,A)

2.

(0,B)

3.

BC

(2,C)

4.

BCA

(3,A)

5.

BA

(2,A)

9
A

Lempel-Ziv-Welch (LZW)
The LZW compression method is derived from LZ78 as introduced by Jacob Ziv and Abraham
Lempel. It was invented by Terry A. Welch in 1984 who had published his considerations in the
article "A Technique for High-Performance Data Compression".

At that time Terry A. Welch was employed in a leading position at the Sperry Research Center.
The LZW method is covered by patents valid for a number of countries, e.g. in USA, Europe and
Japan. Meanwhile Unisys holds the rights, but there are probably more patents also from other
companies regarding LZW. Some of these patents expire in 2003 (USA) and 2004 (Europe,
Japane).

LZW is an important part of a variety of data formats. Graphic formats like gif, tif (optional) and
Postscript (optional) are using LZW for entropy coding.

Fundamental algorithm:

LZW is developing a dictionary that contains any byte sequence already coded. The compressed
data exceptionally consist of indices to this dictionary. Before starting, the dictionary is preset
with entries for the 256 single byte symbols. Any entry following represents sequences larger
than one byte.

The algorithm presented by Terry Welch defines mechanisms to create the dictionary and to
ensure that it will be identical for both the encoding and decoding process.

Arithmetic Coding
Arithmetic coding is the most efficient method to code symbols according to the probability of
their occurrence. The average code length corresponds exactly to the possible minimum given by
information theory. Deviations which are caused by the bit-resolution of binary code trees does
not exist.

In contrast to a binary Huffman code tree the arithmetic coding offers a clearly better
compression rate. Its implementation is more complex on the other hand.

Unfortunately the usage is restricted by patents. As far as known it is not allowed to use
arithmetic coding without acquiring licences.

Arithmetic coding is part of the JPEG data format. Alternative to Huffman coding it will be used
for final entropy coding. In spite of its less efficiency Huffman coding remains the standard due
to the legal restrictions mentioned above.

LZW Compression for


String

LZW Encoding Algorithm


1 Initialize table with single character
strings
2 P = first input character
3 WHILE not end of input stream
4 C = next input character
5 IF P + C is in the string table
6P=P+C
7 ELSE
8 output the code for P
9 add P + C to the string table
10 P = C
11 END WHILE
12 output code for P

Arithmetic Coding
In arithmetic coding, a message is encoded as a real number in an interval from one to zero.
Arithmetic coding typically has a better compression ratio than Huffman coding, as it produces a
single symbol rather than several seperate codewords. Arithmetic coding is a lossless coding
technique. There are a few disadvantages of arithmetic coding. One is that the whole codeword
must be received to start decoding the symbols, and if there is a corrupt bit in the codeword, the
entire message could become corrupt. Another is that there is a limit to the precision of the
number which can be encoded, thus limiting the number of symbols to encode within a
codeword. There also exists many patents upon arithmetic coding, so the use of some of the
algorithms also call upon royalty fees.
Here is the arithmetic coding algorithm, with an example to aid understanding.

1. Start with an interval [0, 1), divided into subintervals of all possible symbols to appear
within a message. Make the size of each subinterval proportional to the frequency at
which it appears in the message. Eg:
Symbol

Probability

Interval

0.2

[0.0,
0.2)

0.3

[0.2,
0.5)

0.1

[0.5,
0.6)

0.4

[0.6,
1.0)

2. When encoding a symbol, "zoom" into the current interval, and divide it into subintervals
like in step one with the new range. Example: suppose we want to encode "abd". We
"zoom" into the interval corresponding to "a", and divide up that interval into smaller
subintervals like before. We now use this new interval as the basis of the next symbol
encoding step.
Symbol

New "a" Interval

[0.0, 0.04)

[0.04, 0.1)

[0.1, 0.102)

[0.102, 0.2)

3. Repeat the process until the maximum precision of the machine is reached, or all symbols
are encoded. To encode the next character "b", we zuse the "a" interval created before,
and zoom into the subinterval "b", and use that for the next step. This produces:
Symbol

New "b" Interval

[0.102, 0.1216)

[0.1216, 0.151)

[0.151, 0.1608)

[0.1608, 0.2)

And lastly, the final result is:


Symbol

New "d" Interval

[0.1608, 0.16864)

[0.16864, 0.1804)

[0.1804, 0.18432)

[0.18432, 0.2)

4. Transmit some number within the latest interval to send the codeword. The number of
symbols encoded will be stated in the protocol of the image format, so any number within
[0.1608, 0.2) will be acceptable.
To decode the message, a similar algorithm is followed, except that the final number is given,
and the symbols are decoded sequentially from that.
Algorithm Overview

Arithmetic coding is similar to Huffman coding; they both achieve their compression by
reducing the average number of bits required to represent a symbol.
Given:
An alphabet with symbols S0, S1, ... Sn, where each symbol has a probability of occurrence of p0,
p1, ... pn such that pi = 1.
From the fundamental theorem of information theory, it can be shown that the optimal coding for
Si requires -(pilog2(pi)) bits.
More often than not, the optimal number of bits is fractional. Unlike Huffman coding, arithmetic
coding provides the ability to represent symbols with fractional bits.
Since, pi = 1, we can represent each probability, pi, as a unique non-overlapping range of values
between 0 and 1. There's no magic in this, we're just creating ranges on a probability line.
For example, suppose we have an alphabet 'a', 'b', 'c', 'd', and 'e' with probabilities of occurrence
of 30%, 15%, 25%, 10%, and 20%. We can choose the following range assignments to each
symbol based on its probability:
TABLE 1. Sample Symbol Ranges
Symbol

Probability

Range

30%

[0.00, 0.30)

15%

[0.30, 0.45)

25%

[0.45, 0.70)

10%

[0.70, 0.80)

20%

[0.80, 1.00)

Where square brackets '[' and ']' mean the adjacent number is included and parenthesis '(' and ')'
mean the adjacent number is excluded.
Ranges assignments like the ones in this table can then be use for encoding and decoding strings
of symbols in the alphabet. Algorithms using ranges for coding are often referred to as range
coders.
Encoding Strings

By assigning each symbol its own unique probability range, it's possible to encode a single
symbol by its range. Using this approach, we could encode a string as a series of probability
ranges, but that doesn't compress anything. Instead additional symbols may be encoded by
restricting the current probability range by the range of a new symbol being encoded. The pseudo
code below illustrates how additional symbols may be added to an encoded string by restricting
the string's range bounds.
lower bound = 0
upper bound = 1
while there are still symbols to encode
current range = upper bound - lower bound
upper bound = lower bound + (current range upper bound of new symbol)
lower bound = lower bound + (current range lower bound of new symbol)
end while

Any value between the computed lower and upper probability bounds now encodes the input
string.
Example:
Encode the string "ace" using the probability ranges from Table 1.
Start with lower and upper probability bounds of 0 and 1.
Encode 'a'
current range = 1 - 0 = 1
upper bound = 0 + (1 0.3) = 0.3
lower bound = 0 + (1 0.0) = 0.0
Encode 'c'

current range = 0.3 - 0.0 = 0.3


upper bound = 0.0 + (0.3 0.70) = 0.210
lower bound = 0.0 + (0.3 0.45) = 0.135
Encode 'e'
current range = 0.210 - 0.135 = 0.075
upper bound = 0.135 + (0.075 1.00) = 0.210
lower bound = 0.135 + (0.075 0.80) = 0.195
The string "ace" may be encoded by any value within the probability range [0.195, 0.210).
It should become apparent from the example that precision requirements increase as additional
symbols are encoded. Strings of unlimited length require infinite precision probability range
bounds. The section on implementation discusses how the need for infinite precision is handled.
Decoding Strings

The decoding process must start with a an encoded value representing a string. By definition, the
encoded value lies within the lower and upper probability range bounds of the string it
represents. Since the encoding process keeps restricting ranges (without shifting), the initial
value also falls within the range of the first encoded symbol. Successive encoded symbols may
be identified by removing the scaling applied by the known symbol. To do this, subtract out the
lower probability range bound of the known symbol, and multiply by the size of the symbols'
range.
Based on the discussion above, decoding a value may be performed following the steps in the
pseudo code below:
encoded value = encoded input
while string is not fully decoded
identify the symbol containing encoded value within its range
//remove effects of symbol from encoded value
current range = upper bound of new symbol - lower bound of new symbol
encoded value = (encoded value - lower bound of new symbol) current range
end while

Example:
Using the probability ranges from Table 1 decode the three character string encoded as 0.20.
Decode first symbol
0.20 is within [0.00, 0.30)

0.20 encodes 'a'


Remove effects of 'a' from encode value
current range = 0.30 - 0.00 = 0.30
encoded value = (0.20 - 0.0) 0.30 = 0.67 (rounded)
Decode second symbol
0.67 is within [0.45, 0.70)
0.67 encodes 'c'
Remove effects of 'c' from encode value
current range = 0.70 - 0.45 = 0.35
encoded value = (0.67 - 0.45) 0.35 = 0.88
Decode third symbol
0.88 is within [0.80, 1.00)
0.88 encodes 'e'
The encoded string is "ace".
In case you were sleeping, this is the string that was encoded in the encoding example.

You might also like