Algorithms for data compression usually proceed as follows. They encode a text over some finite alphabet into a sequence of bits, hereby exploiting the fact that the letters of this alphabet occur with different frequencies. For instance, an “e” occurs more frequently than a “q” and will therefore be assigned a shorter codeword. The quality of the compression procedure is then measured in terms of the average codeword length.

So the underlying model is probabilistic, namely we consider a finite alphabet and a probability distribution on this alphabet, where the probability distribution reflects the (relative) frequencies of the letters. Such a pair—an alphabet with a probability distribution—is called a source. We shall first introduce some basic facts from Information Theory. Most important is the notion of entropy, since the source entropy characterises the achievable lower bounds for compressibility.

The source model to be best understood, is the discrete memoryless source. Here the letters occur independently of each other in the text. The use of prefix codes, in which no codeword is the beginning of another one, allows to compress the text down to the entropy of the source. We shall study this in detail. The lower bound is obtained via Kraft's inequality, the achievability is demonstrated by the use of Huffman codes, which can be shown to be optimal.

There are some assumptions on the discrete memoryless source, which are not fulfilled in most practical situations. Firstly, usually this source model is not realistic, since the letters do not occur independently in the text. Secondly, the probability distribution is not known in advance. So the coding algorithms should be universal for a whole class of probability distributions on the alphabet. The analysis of such universal coding techniques is much more involved than the analysis of the discrete memoryless source, such that we shall only present the algorithms and do not prove the quality of their performance. Universal coding techniques mainly fall into two classes.

Statistical coding techniques estimate the probability of the next letters as accurately as possible. This process is called modelling of the source. Having enough information about the probabilities, the text is encoded, where usually arithmetic coding is applied. Here the probability is represented by an interval and this interval will be encoded.

Dictionary-based algorithms store patterns, which occurred before in the text, in a dictionary and at the next occurrence of a pattern this is encoded via its position in the dictionary. The most prominent procedure of this kind is due to Ziv and Lempel.

We shall also present a third universal coding technique which falls in neither of these two classes. The algorithm due to Burrows and Wheeler has become quite prominent in recent years, since implementations based on it perform very well in practice.

All algorithms mentioned so far are lossless, i. e., there is no information lost after decoding. So the original text will be recovered without any errors. In contrast, there are lossy data compression techniques, where the text obtained after decoding does not completely coincide with the original text. Lossy compression algorithms are used in applications like image, sound, video, or speech compression. The loss should, of course, only marginally effect the quality. For instance, frequencies not realizable by the human eye or ear can be dropped. However, the understanding of such techniques requires a solid background in image, sound or speech processing, which would be far beyond the scope of this paper, such that we shall illustrate only the basic concepts behind image compression algorithms such as JPEG.

We emphasise here the recent developments such as the Burrows-Wheeler transform and the context–tree weighting method. Rigorous proofs will only be presented for the results on the discrete memoryless source which is best understood but not a very realistic source model in practice. However, it is also the basis for more complicated source models, where the calculations involve conditional probabilities. The asymptotic computational complexity of compression algorithms is often linear in the text length, since the algorithms simply parse through the text. However, the running time relevant for practical implementations is mostly determined by the constants as dictionary size in Ziv-Lempel coding or depth of the context tree, when arithmetic coding is applied. Further, an exact analysis or comparison of compression algorithms often heavily depends on the structure of the source or the type of file to be compressed, such that usually the performance of compression algorithms is tested on benchmark files. The most well-known collections of benchmark files are the Calgary Corpus and the Canterbury Corpus.

The source model discussed throughout this chapter is the **Discrete Memoryless Source** (DMS). Such a source is a pair , where , is a finite alphabet and is a probability distribution on . A discrete memoryless source can also be described by a random variable , where for all . A word is the realization of the random variable , where the are identically distributed and independent of each other. So the probability is the product of the probabilities of the single letters.

Estimations for the letter probabilities in natural languages are obtained by statistical methods. If we consider the English language and choose for the latin alphabet with an additional symbol for Space and punctuation marks, the probability distribution can be derived from the frequency table in 3.1, which is obtained from the copy–fitting tables used by professional printers. So , etc.

Observe that this source model is often not realistic. For instance, in English texts e.g. the combination `th' occurs more often than `ht'. This could not be the case, if an English text was produced by a discrete memoryless source, since then .

In the discussion of the communication model it was pointed out that the encoder wants to compress the original data into a short sequence of binary digits, hereby using a binary code, i. e., a function . To each element a codeword is assigned. The aim of the encoder is to minimise the average length of the codewords. It turns out that the best possible data compression can be described in terms of the **entropy**
of the probability distribution . The entropy is given by the formula

where the logarithm is to the base 2. We shall also use the notation according to the interpretation of the source as a random variable.

A **code** (of variable length) is a function , . Here is the set of **codewords**, where for the codeword is where denotes the **length** of , i. e., the number of bits used to present .

A code is **uniquely decipherable** (UDC), if every word in is representable by at most one sequence of codewords.

A code is a **prefix code**, if no codeword is prefix of another one, i. e., for any two codewords and , , with holds . So in at least one of the first components and differ.

Messages encoded using a prefix code are uniquely decipherable. The decoder proceeds by reading the next letter until a codeword is formed. Since cannot be the beginning of another codeword, it must correspond to the letter . Now the decoder continues until another codeword is formed. The process may be repeated until the end of the message. So after having found the codeword the decoder instantaneously knows that is the next letter of the message. Because of this property a prefix code is also denoted as instantaneous code.

The criterion for data compression is to minimise the average length of the codewords. So if we are given a source , where and is a probability distribution on , the **average length**
is defined by

The following prefix code for English texts has average length .

We can still do better, if we do not encode single letters, but blocks of letters for some . In this case we replace the source by for some . Remember that for a word , since the source is memoryless. If e.g. we are given an alphabet with two letters, and , , then the code defined by , has average length . Obviously we cannot find a better code. The combinations of two letters now have the following probabilities:

The prefix code defined by

has average length . So could be interpreted as the average length the code requires per letter of the alphabet . When we encode blocks of letters we are interested in the behaviour of

It follows from the Noiseless Coding Theorem, which is stated in the next section, that the entropy of the source .

In our example for the English language we have . So the code presented above, where only single letters are encoded, is already nearly optimal in respect of . Further compression is possible, if we consider the dependencies between the letters.

We shall now introduce a necessary and sufficient condition for the existence of a prefix code with prescribed word lengths .

**Theorem 3.1 **
**(Kraft's inequality)**
*Let . A uniquely decipherable code with word lengths exists, if and only if*

**Proof. **The central idea is to interpret the codewords as nodes of a rooted binary tree with depth . The tree is required to be complete (every path from the root to a leaf has length ) and regular (every inner node has outdegree 2). The example in Figure 3.2 for may serve as an illustration.

So the nodes with distance from the root are labeled with the words . The upper successor of is labeled , its lower successor is labeled .

The **shadow** of a node labeled by is the set of all the leaves which are labeled by a word (of length ) beginning with . In other words, the shadow of consists of the leaves labeled by a sequence with prefix . In our example is the shadow of the node labeled by .

Now suppose we are given positive integers . We further assume that . As first codeword is chosen. Since , we have (otherwise only one letter has to be encoded). Hence there are left some nodes on the level, which are not in the shadow of . We pick the first of these remaining nodes and go back steps in direction to the root. Since we shall find a node labeled by a sequence of bits, which is not a prefix of . So we can choose this sequence as . Now again, either , and we are ready, or by the hypothesis and we can find a node on the level, which is not contained in the shadows of and . We find the next codeword as shown above. The process can be continued until all codewords are assigned.

Conversely, observe that , where is the number of codewords with length in the uniquely decipherable prefix code and again denotes the maximal word length.

The power of this term can be expanded as

Here is the total number of messages whose coded representation is of length .

Since the code is uniquely decipherable, to every sequence of letters corresponds at most one possible message. Hence and . Taking –th root this yields .

Since this inequality holds for any and , we have the desired result

**Theorem 3.2 **
**(Noiseless Coding Theorem)**
*For a source , it is always possible to find a uniquely decipherable code with average length*

**Proof. **Let denote the codeword lengths of an optimal uniquely decipherable code. Now we define a probability distribution by for , where . By Kraft's inequality .

For two probability distributions and on the **I-divergence**
is defined by

I-divergence is a good measure for the distance of two probability distributions. Especially, always the I-divergence . So for any probability distribution

From this it follows that

Since , and hence .

In order to prove the right-hand side of the Noiseless Coding Theorem for we define . Observe that and hence .

So and from Kraft's Inequality we know that there exists a uniquely decipherable code with word lengths . This code has average length

In the proof of the Noiseless Coding Theorem it was explicitly shown how to construct a prefix code to a given probability distribution . The idea was to assign to each a codeword of length by choosing an appropriate vertex in the tree introduced. However, this procedure does not always yield an optimal code. If e.g. we are given the probability distribution , we would encode , , and thus achieve an average codeword length . But the code with , , has only average length .

Shannon gave an explicit procedure for obtaining codes with codeword lengths using the binary representation of cumulative probabilities (Shannon remarked this procedure was originally due to Fano). The elements of the source are ordered according to increasing probabilities . Then the codeword consists of the first bits of the binary expansion of the sum . This procedure was further developed by Elias. The elements of the source now may occur in any order. The **Shannon-Fano-Elias-code** has as codewords the first bits of the binary expansion of the sum .

We shall illustrate these procedures with the example in Figure 3.3.

A more efficient procedure is also due to Shannon and Fano. The **Shannon-Fano-algorithm** will be illustrated by the same example in Figure 3.4.

The messages are first written in order of nonincreasing probabilities. Then the message set is partitioned into two most equiprobable subsets and . A is assigned to each message contained in one subset and a to each of the remaining messages. The same procedure is repeated for subsets of and ; that is, will be partitioned into two subsets and . Now the code word corresponding to a message contained in will start with and that corresponding to a message in will begin with . This procedure is continued until each subset contains only one message.

However, this algorithm neither yields an optimal code in general, since the prefix code , , , , , , has average length .

The **Huffman coding algorithm** is a recursive procedure, which we shall illustrate with the same example as for the Shannon-Fano-algorithm in Figure 3.5 with and . The source is successively reduced by one element. In each reduction step we add up the two smallest probabilities and insert their sum in the increasingly ordered sequence , thus obtaining a new probability distribution with . Finally we arrive at a source with two elements ordered according to their probabilities. The first element is assigned a , the second element a . Now we again “blow up” the source until the original source is restored. In each step and are obtained by appending or , respectively, to the codeword corresponding to .

The following theorem demonstrates that the Huffman coding algorithm always yields a prefix code optimal with respect to the average codeword length.

**Theorem 3.3 **
*We are given a source , where and the probabilities are ordered non–increasingly . A new probability distribution is defined by*

Let be an optimal prefix code for . Now we define a code for the distribution by

Then is an optimal prefix code for and , where denotes the length of an optimal prefix code for probability distribution .

**Proof. **For a probability distribution on with there exists an optimal prefix code with

i)

ii)

iii) and differ exactly in the last position.

This holds, since:

i) Assume that there are with and . Then the code obtained by interchanging the codewords and has average length , since

ii) Assume we are given a code with . Because of the prefix property we may drop the last bits and thus obtain a new code with .

iii) If no two codewords of maximal length agree in all places but the last, then we may drop the last digit of all such codewords to obtain a better code.

Now we are ready to prove the statement from the theorem. From the definition of and we have

Now let be an optimal prefix code with the properties ii) and iii) from the preceding lemma. We define a prefix code for

by for and is obtained by dropping the last bit of or .

Now

and hence , since .

If denotes the size of the source alphabet, the Huffman coding algorithm needs additions and code modifications (appending or ). Further we need insertions, such that the total complexity can be roughly estimated to be . However, observe that with the Noiseless Coding Theorem, the quality of the compression rate can only be improved by jointly encoding blocks of, say, letters, which would result in a Huffman code for the source of size . So, the price for better compression is a rather drastic increase in complexity. Further, the codewords for all letters have to be stored. Encoding a sequence of letters can since be done in steps.

**Exercises**

3.1-1 Show that the code with and is uniquely decipherable but not instantaneous for any .

3.1-2 Compute the entropy of the source , with and .

3.1-3 Find the Huffman-codes and the Shannon-Fano-codes for the sources with as in the previous exercise for and calculate their average codeword lengths.

3.1-4 Show that always .

3.1-5 Show that the redundancy of a prefix code for a source with probability distribution can be expressed as a special I–divergence.

3.1-6 Show that the I-divergence for all probability distributions and over some alphabet with equality exactly if but that the I-divergence is not a metric.