## 3.2. 3.2 Arithmetic coding and modelling

In statistical coding techniques as Shannon-Fano- or Huffman-coding the probability distribution of the source is modelled as accurately as possible and then the words are encoded such that a higher probability results in a shorter codeword length.

We know that Huffman-codes are optimal with respect to the average codeword length. However, the entropy is approached by increasing the block length. On the other hand, for long blocks of source symbols, Huffman-coding is a rather complex procedure, since it requires the calculation of the probabilities of all sequences of the given block length and the construction of the corresponding complete code.

For compression techniques based on statistical methods often arithmetic coding is preferred. Arithmetic coding is a straightforward extension of the Shannon-Fano-Elias-code. The idea is to represent a probability by an interval. In order to do so, the probabilities have to be calculated very accurately. This process is denoted as modelling of the source. So statistical compression techniques consist of two stages: modelling and coding. As just mentioned, coding is usually done by arithmetic coding. The different algorithms like, for instance, DCM (Discrete Markov Coding) and PPM (Prediction by Partial Matching) vary in the way of modelling the source. We are going to present the context-tree weighting method, a transparent algorithm for the estimation of block probabilities due to Willems, Shtarkov, and Tjalkens, which also allows a straightforward analysis of the efficiency.

### 3.2.1. 3.2.1 Arithmetic coding

The idea behind arithmetic coding is to represent a message by interval , where is the sum of the probabilities of those sequences which are smaller than in lexicographic order.

A codeword assigned to message also corresponds to an interval. Namely, we identify codeword of length with interval , where is the binary expansion of the nominator in the fraction . The special choice of codeword will be obtained from and as follows:

So message is encoded by a codeword , whose interval is inside interval .

Let us illustrate arithmetic coding by the following example of a discrete memoryless source with and .

At first glance it may seem that this code is much worse than the Huffman code for the same source with codeword lengths () we found previously. On the other hand, it can be shown that arithmetic coding always achieves an average codeword length , which is only two bits apart from the lower bound in the noiseless coding theorem. Huffman coding would usually yield an even better code. However, this “negligible” loss in compression rate is compensated by several advantages. The codeword is directly computed from the source sequence, which means that we do not have to store the code as in the case of Huffman coding. Further, the relevant source models allow to easily compute and from , usually by multiplication by . This means that the sequence to be encoded can be parsed sequentially bit by bit, unlike in Huffman coding, where we would have to encode blockwise.

#### 3.2.1.1.  Encoding.

The basic algorithm for encoding a sequence by arithmetic coding works as follows. We assume that , (in the case for all the discrete memoryless source arises, but in the section on modelling more complicated formulae come into play) and hence

Starting with and the first letters of the text to be compressed determine the current interval . These current intervals are successively refined via the recursions

is usually denoted as augend. The final interval will then be encoded by interval as described above. So the algorithm looks as follows.

`Arithmetic-Encoder(` `)`

```  1     2     3
`FOR`

`TO`

4
`DO`

5          6     7     8
`RETURN`

```

We shall illustrate the encoding procedure by the following example from the literature. Let the discrete, memoryless source be given with ternary alphabet and , , . The sequence has to be encoded. Observe that and for all . Further , , and .

The above algorithm yields

Hence and . From this can be calculated that and finally whose binary representation is codeword .

#### 3.2.1.2.  Decoding.

Decoding is very similar to encoding. The decoder recursively “undoes” the encoder's recursion. We divide the interval into subintervals with bounds defined by . Then we find the interval in which codeword can be found. This interval determines the next symbol. Then we subtract and rescale by multiplication by .

`Arithmetic-Decoder(` `)`

```  1
`FOR`

`TO`

2
`DO`

3
`WHILE`

4
`DO`

5                6                7
`RETURN`

```

Observe that when the decoder only receives codeword he does not know when the decoding procedure terminates. For instance can be the codeword for , , , etc. In the above pseudocode it is implicit that the number of symbols has also been transmitted to the decoder, in which case it is clear what the last letter to be encoded was. Another possibility would be to provide a special end-of-file (EOF)-symbol with a small probability, which is known to both the encoder and the decoder. When the decoder sees this symbol, he stops decoding. In this case line 1 would be replaced by

```  1
`WHILE`
(EOF) ```

(and would have to be increased). In our above example, the decoder would receive the codeword , the binary expansion of up to bits. This number falls in the interval which belongs to the letter , hence the first letter . Then he calculates . Again this number is in the interval and the second letter is . In order to determine the calculation must be performed. Again such that also . Finally . Since , the last letter of the sequence must be .

#### 3.2.1.3.  Correctness.

Recall that message is encoded by a codeword , whose interval is inside interval . This follows from .

Obviously a prefix code is obtained, since a codeword can only be a prefix of another one, if their corresponding intervals overlap – and the intervals are obviously disjoint for different -s.

Further, we mentioned already that arithmetic coding compresses down to the entropy up to two bits. This is because for every sequence it is . It can also be shown that the additional transmission of block length or the introduction of the EOF symbol only results in a negligible loss of compression.

However, the basic algorithms we presented are not useful in order to compress longer files, since with increasing block length the intervals are getting smaller and smaller, such that rounding errors will be unavoidable. We shall present a technique to overcome this problem in the following.

#### 3.2.1.4.  Analysis.

The basic algorithm for arithmetic coding is linear in the length of the sequence to be encoded. Usually, arithmetic coding is compared to Huffman coding. In contrast to Huffman coding, we do not have to store the whole code, but can obtain the codeword directly from the corresponding interval. However, for a discrete memoryless source, where the probability distribution is the same for all letters, this is not such a big advantage, since the Huffman code will be the same for all letters (or blocks of letters) and hence has to be computed only once. Huffman coding, on the other hand, does not use any multiplications which slow down arithmetic coding.

For the adaptive case, in which the 's may change for different letters to be encoded, a new Huffman code would have to be calculated for each new letter. In this case, usually arithmetic coding is preferred. We shall investigate such situations in the section on modelling. For implementations in practice floating point arithmetic is avoided. Instead, the subdivision of the interval is represented by a subdivision of the integer range , say, with proportions according to the source probabilities. Now integer arithmetic can be applied, which is faster and more precise.

#### 3.2.1.5.  Precision problem.

In the basic algorithms for arithmetic encoding and decoding the shrinking of the current interval would require the use of high precision arithmetic for longer sequences. Further, no bit of the codeword is produced until the complete sequence has been read in. This can be overcome by coding each bit as soon as it is known and then double the length of the current interval , say, so that this expansion represents only the unknown part of the interval. This is the case when the leading bits of the lower and upper bound are the same, i. e. the interval is completely contained either in or in . The following expansion rules guarantee that the current interval does not become too small.

Case 1 (): , .

Case 2 (): , .

Case 3 (): , .

The last case called underflow (or follow) prevents the interval from shrinking too much when the bounds are close to . Observe that if the current interval is contained in with , we do not know the next output bit, but we do know that whatever it is, the following bit will have the opposite value. However, in contrast to the other cases we cannot continue encoding here, but have to wait (remain in the underflow state and adjust a counter to the number of subsequent underflows, i. e. ) until the current interval falls into either or . In this case we encode the leading bit of this interval – for and for – followed by many inverse bits and then set . The procedure stops, when all letters are read in and the current interval does not allow any further expansion.

`Arithmetic-Precision-Encoder(` `)`

```  1     2     3     4     5
`FOR`

`TO`

6
`DO`

7          8          9
`WHILE`

AND NOT ( AND )  10
`DO`

`IF`

11
`THEN`

many s  12                   13                   14                   15
`ELSE`

`IF`

16
`THEN`

,  many 0s  17                      18                      19                     20
`ELSE IF`

AND   21
`THEN`

22                      23                  24
`IF`

25
`THEN`

,  many 1s)  26
`RETURN`

```

We shall illustrate the encoding algorithm in Figure3.6 by our example – the encoding of the message with alphabet and probability distribution . An underflow occurs in the sixth row: we keep track of the underflow state and later encode the inverse of the next bit, here this inverse bit is the in the ninth row. The encoded string is .

Precision-decoding involves the consideration of a third variable besides the interval bounds LO and HI.

### 3.2.2. 3.2.2 Modelling

#### 3.2.2.1.  Modelling of memoryless sources.

In this section we shall only consider binary sequences to be compressed by an arithmetic coder. Further, we shortly write instead of in order to allow further subscripts and superscripts for the description of the special situation. will denote estimated probabilities, weighted probabilities, and probabilities assigned to a special context .

The application of arithmetic coding is quite appropriate if the probability distribution of the source is such that can easily be calculated from . Obviously this is the case, when the source is discrete and memoryless, since then .

Even when the underlying parameter of a binary, discrete memoryless source is not known, there is an efficient way due to Krichevsky and Trofimov to estimate the probabilities via

where and denote the number of s and s, respectively, in the sequence . So given the sequence with many s and many s, the probability that the next letter will be a is estimated as . The estimated block probability of a sequence containing zeros and ones then is

with initial values and as in Figure 3.7, where the values of the Krichevsky-Trofimov–estimator for small are listed.

Note that the summand in the nominator guarantees that the probability for the next letter to be a is positive even when the symbol did not occur in the sequence so far. In order to avoid infinite codeword length, this phenomenon has to be carefully taken into account when estimating the probability of the next letter in all approaches to estimate the parameters, when arithmetic coding is applied.

#### 3.2.2.2.  Modells with known context tree.

In most situations the source is not memoryless, i. e., the dependencies between the letters have to be considered. A suitable way to represent such dependencies is the use of a suffix tree, which we denote as context tree. The context of symbol is suffix preceding . To each context (or leaf in the suffix tree) there corresponds a parameter , which is the probability of the occurrence of a 1 when the last sequence of past source symbols is equal to context (and hence is the probability for a in this case). We are distinguishing here between the model (the suffix tree) and the parameters ().

Example 3.1 Let and , , and . The corresponding suffix tree jointly with the parsing process for a special sequence can be seen in Figure 3.8.

The actual probability of the sequence ' ' given the past ' ' is , since the first letter is preceded by suffix , the second letter is preceded by suffix , etc.

Suppose the model is known, but not the parameters . The problem now is to find a good coding distribution for this case. The tree structure allows to easily determine which context precedes a particular symbol. All symbols having the same context (or suffix) form a memoryless source subsequence whose probability is determined by the unknown parameter . In our example these subsequences are ' ' for , ' ' for and ' ' for . One uses the Krichevsky-Trofimov-estimator for this case. To each node in the suffix tree, we count the numbers of zeros and of ones preceded by suffix . For the children and of parent node obviously and must be satisfied.

In our example for the root , and , . Further , , , , , , , , , and . These last numbers are not relevant for our special source but will be important later on, when the source model or the corresponding suffix tree, respectively, is not known in advance.

Example 3.2 Let as in the previous example. Encoding a subsequence is done by successively updating the corresponding counters for and . For example, when we encode the sequence ' ' given the past ' ' using the above suffix tree and Krichevsky-Trofimov-estimator we obtain

where , and are the probabilities of the subsequences ' ', ' ' and ' ' in the context of the leaves. These subsequences are assumed to be memoryless.

#### 3.2.2.3.  The context-tree weighting method.

Suppose we have a good coding distribution for source 1 and another one, , for source 2. We are looking for a good coding distribution for both sources. One possibility is to compute and and then 1 bit is needed to identify the best model which then will be used to compress the sequence. This method is called selecting. Another possibility is to employ the weighted distribution, which is

We shall present now the context-tree weighting algorithm. Under the assumption that a context tree is a full tree of depth , only and , i. e. the number of zeros and ones in the subsequence of bits preceded by context , are stored in each node of the context tree.

Further, to each node is assigned a weighted probability which is recursively defined as

where describes the length of the (binary) string and is the estimated probability using the Krichevsky-Trofimov-estimator.

Example 3.3 After encoding the sequence ' ' given the past ' ' we obtain the context tree of depth 3 in Figure 3.9. The weighted probability of the root node finally yields the coding probability corresponding to the parsed sequence.

Recall that for the application in arithmetic coding it is important that probabilities and can be efficiently calculated from . This is possible with the context-tree weighting method, since the weighted probabilities only have to be updated, when is changing. This just occurs for the contexts along the path from the root to the leaf in the context tree preceding the new symbol —namely the contexts for and the root . Along this path, has to be performed, when , and has to be performed, when , and the corresponding probabilities and have to be updated.

This suggests the following algorithm for updating the context tree when reading the next letter . Recall that to each node of the tree we store the parameters , and . These parameters have to be updated in order to obtain . We assume the convention that the ordered pair denotes the root .

`Update-Context-Tree(` `)`

```  1     2
`IF`

3
`THEN`

4          5
`ELSE`

6          7
`FOR`

`TO`

8
`DO`

9
`IF`

10
`THEN`

11               12
`ELSE`

13               14      15
`RETURN`

```

The probability assigned to the root in the context tree will be used for the successive subdivisions in arithmetic coding. Initially, before reading , the parameters in the context tree are , , and for all contexts in the tree. In our example the updates given the past would yield the successive probabilities : for , for , for , for , for , for , for , and finally for .

#### 3.2.2.4.  Correctness.

Recall that the quality of a code concerning its compression capability is measured with respect to the average codeword length. The average codeword length of the best code comes as close as possible to the entropy of the source. The difference between the average codeword length and the entropy is denoted as the redundancy of code , hence

which obviously is the weighted (by ) sum of the individual redundancies

The individual redundancy of sequences given the (known) source for all for , is bounded by

The individual redundancy of sequences using the context–tree weighting algorithm (and hence a complete tree of all possible contexts as model ) is bounded by

Comparing these two formulae, we see that the difference of the individual redundancies is bits. This can be considered as the cost of not knowing the model, i.e. the model redundancy. So, the redundancy splits into the parameter redundancy, i. e. the cost of not knowing the parameter, and the model redundancy. It can be shown that the expected redundancy behaviour of the context-tree weighting method achieves the asymptotic lower bound due to Rissanen who could demonstrate that about bits per parameter is the minimum possible expected redundancy for .

#### 3.2.2.5.  Analysis.

The computational complexity is proportional to the number of nodes that are visited when updating the tree, which is about . Therefore, the number of operations necessary for processing symbols is linear in . However, these operations are mainly multiplications with factors requiring high precision.

As for most modelling algorithms, the backlog of implementations in practice is the huge amount of memory. A complete tree of depth has to be stored and updated. Only with increasing the estimations of the probabilities are becoming more accurate and hence the average codeword length of an arithmetic code based on these estimations would become shorter. The size of the memory, however, depends exponentially on the depth of the tree.

We presented the context--tree weighting method only for binary sequences. Note that in this case the cumulative probability of a binary sequence can be calculated as

For compression of sources with larger alphabets, for instance ASCII-files, we refer to the literature.

Exercises

3.2-1 Compute the arithmetic codes for the sources , with and and compare these codes with the corresponding Huffman-codes derived previously.

3.2-2 For the codes derived in the previous exercise compute the individual redundancies of each codeword and the redundancies of the codes.

3.2-3 Compute the estimated probabilities for the sequence and all its subsequences using the Krichevsky-Trofimov-estimator.

3.2-4 Compute all parameters and the estimated probability for the sequence given the past , when the context tree is known. What will be the codeword of an arithmetic code in this case?

3.2-5 Compute all parameters and the estimated probability for the sequence given the past , when the context tree is not known, using the context-tree weighting algorithm.

3.2-6 Based on the computations from the previous exercise, update the estimated probability for the sequence given the past .

Show that for the cumulative probability of a binary sequence it is