3.4. 3.4 The Burrows-Wheeler-transform

The Burrows-Wheeler-transform will best be demonstrated by an example. Assume that our original text is “WHEELER”. This text will be mapped to a second text and an index according to the following rules.

1) We form a matrix consisting of all cyclic shifts of the original text . In our example

2) From we obtain a new matrix by simply ordering the rows in lexicographically. Here this yields the matrix

3) The transformed string then is just the last column of the matrix and the index is the number of the row of , in which the original text is contained. In our example “HELWEER” and – we start counting the the rows with row no. .

This gives rise to the following pseudocode. We write here instead of and instead of , since the purpose of the vector notation is only to distinguish the vectors from the letters in the text.

BWT-Encoder( )

  1  
                     FOR
                   
                   
                  
                     TO
                   
                     2    
                     DO
                   
                     3  
                     FOR
                   
                   
                  
                     TO
                   
                     4    
                     DO
                   
                  
                     FOR
                   
                   
                  
                     TO
                   
                     5          
                     DO
                   
                     6  
                     FOR
                   
                   
                  
                     TO
                   
                     7    
                     DO
                   row  of  row  of  in lexicographic order   8  
                     FOR
                   
                   
                  
                     TO
                   
                     9    
                     DO
                   
                    10     11  
                     WHILE
                   (row  of  row  of )  12    
                     DO
                   
                    13     14  
                     RETURN
                   
                   and  
               

It can be shown that this transformation is invertible, i. e., it is possible to reconstruct the original text from its transform and the index . This is because these two parameters just yield enough information to find out the underlying permutation of the letters. Let us illustrate this reconstruction using the above example again. From the transformed string we obtain a second string by simply ordering the letters in in ascending order. Actually, is the first column of the matrix above. So, in our example

Now obviously the first letter of our original text is the letter in position of the sorted string , so here . Then we look at the position of the letter just considered in the string – here there is only one W, which is letter no. 3 in . This position gives us the location of the next letter of the original text, namely . is found in position no. 0 in , hence . Now there are three E–s in the string and we take the first one not used so far, here the one in position no. 1, and hence . We iterate this procedure and find , , .

This suggests the following pseudocode.

BWT-Decoder( )

  1   sort    2     3  
                     FOR
                   
                   
                  
                     TO
                   
                     4    
                     DO
                   
                     5       
                     WHILE
                   ( OR  is a component of )   6          
                     DO
                   
                     7                 8                9  
                     RETURN
                   
                   
               

This algorithm implies a more formal description. Since the decoder only knows , he has to sort this string to find out . To each letter from the transformed string record the position in from which it was jumped to by the process described above. So the vector in our pseudocode yields a permutation such that for each row it is in matrix . In our example . This permutation can be used to reconstruct the original text of length via , where and for .

Observe that so far the original data have only been transformed and are not compressed, since string has exactly the same length as the original string . So what is the advantage of the Burrows-Wheeler transformation? The idea is that the transformed string can be much more efficiently encoded than the original string. The dependencies among the letters have the effect that in the transformed string there appear long blocks consisting of the same letter.

In order to exploit such frequent blocks of the same letter, Burrows and Wheeler suggested the following move-to-front-code, which we shall illustrate again with our example above.

We write down a list containing the letters used in our text in alphabetic order indexed by their position in this list.

Then we parse through the transformed string letter by letter, note the index of the next letter and move this letter to the front of the list. So in the first step we note 1—the index of the H, move H to the front and obtain the list

Then we note 1 and move E to the front,

note 2 and move L to the front,

note 4 and move W to the front,

note 2 and move E to the front,

note 0 and leave E at the front,

note 4 and move R to the front,

So we obtain the sequence as our move-to-front-code. The pseudocode may look as follows, where is a list of the letters occuring in the string .

Move-To-Front( )

  1   list of  letters occuring in  ordered alphabetically    2  
                     FOR
                   
                   
                  
                     TO
                   
                     3    
                     DO
                   
                      4       
                     WHILE
                   ()   5             6          7  
                     FOR
                   
                   
                  
                     TO
                   
                     8    
                     DO
                   
                     9  
                     RETURN
                   
                   
               

The move-to-front-code will finally be compressed, for instance by Huffman-coding.

Correctness.

The compression is due to the move-to-front-code obtained from the transformed string . It can easily be seen that this move-to-front coding procedure is invertible, so one can recover the string from the code obtained as above.

Now it can be observed that in the move-to-front-code small numbers occur more frequently. Unfortunately, this will become obvious only with much longer texts than in our example—in long strings it was observed that even about 70 per cent of the numbers are . This irregularity in distribution can be exploited by compressing the sequence obtained after move-to-front-coding, for instance by Huffman-codes or run-length codes.

The algorithm performed very well in practice regarding the compression rate as well as the speed. The asymptotic optimality of compression has been proven for a wide class of sources.

Analysis.

The most complex part of the Burrows-Wheeler transform is the sorting of the block yielding the transformed string . Due to fast sorting procedures, especially suited for the type of data to be compressed, compression algorithms based on the Burrows-Wheeler transform are usually very fast. On the other hand, compression is done blockwise. The text to be compressed has to be divided into blocks of appropriate size such that the matrices and still fit into the memory. So the decoder has to wait until the whole next block is transmitted and cannot work sequentially bit by bit as in arithmetic coding or Ziv-Lempel coding.

Exercises

3.4-1 Apply the Burrows-Wheeler-transform and the move-to-front code to the text “abracadabra”.

3.4-2 Verify that the transformed string and the index of the position in the sorted text (containing the first letter of the original text to be compressed) indeed yield enough information to reconstruct the original text.

3.4-3 Show how in our example the decoder would obtain the string ”HELWEER” from the move-to-front code and the letters E, H, L, W, R occuring in the text. Describe the general procedure for decoding move-to-front codes.

3.4-4 We followed here the encoding procedure presented by Burrows and Wheeler. Can the encoder obtain the transformed string even without constructing the two matrices and ?