In 1976–1978 Jacob Ziv and Abraham Lempel introduced two universal coding algorithms, which in contrast to statistical coding techniques, considered so far, do not make explicit use of the underlying probability distribution. The basic idea here is to replace a previously seen string with a pointer into a history buffer (LZ77) or with the index of a dictionary (LZ78). LZ algorithms are widely used—”zip” and its variations use the LZ77 algorithm. So, in contrast to the presentation by several authors, Ziv-Lempel-coding is not a single algorithm. Originally, Lempel and Ziv introduced a method to measure the complexity of a string—like in Kolmogorov complexity. This led to two different algorithms, LZ77 and LZ78. Many modifications and variations have been developed since. However, we shall present the original algorithms and refer to the literature for further information.
The idea of LZ77 is to pass a sliding window over the text to be compressed. One looks for the longest substring in this window representing the next letters of the text. The window consists of two parts: a history window of length , say, in which the last bits of the text considered so far are stored, and a lookahead window of length containing the next bits of the text. In the simplest case and are fixed. Usually, is much bigger than . Then one encodes the triple (offset, length, letter). Here the offset is the number of letters one has to go back in the text to find the matching substring, the length is just the length of this matching substring, and the letter to be stored is the letter following the matching substring. Let us illustrate this procedure with an example. Assume the text to be compressed is , the window is of size 15 with letters history and letters lookahead buffer. Assume, the sliding window now arrived at
i. e., the history window contains the 10 letters and the lookahead window contains the five letters . The longest substring matching the first letters of the lookahead window is of length , which is found nine letters back from the right end of the history window. So we encode , since is the next letter (the string is also found five letters back, in the original LZ77 algorithm one would select the largest offset). The window then is moved letters forward
The next codeword is , since the longest matching substring is of length found letters backwards and is the letter following this substring in the lookahead window. We proceed with
and encode . Further
Here we encode . Observe that the match can extend into the lookahead window.
There are many subtleties to be taken into account. If a symbol did not appear yet in the text, offset and length are set to . If there are two matching strings of the same length, one has to choose between the first and the second offset. Both variations have advantages. Initially one might start with an empty history window and the first letters of the text to be compressed in the lookahead window - there are also further variations.
A common modification of the original scheme is to output only the pair (offset, length) and not the following letter of the text. Using this coding procedure one has to take into consideration the case in which the next letter does not occur in the history window. In this case, usually the letter itself is stored, such that the decoder has to distinguish between pairs of numbers and single letters. Further variations do not necessarily encode the longest matching substring.
LZ78 does not use a sliding window but a dictionary which is represented here as a table with an index and an entry. LZ78 parses the text to be compressed into a collection of strings, where each string is the longest matching string seen so far plus the symbol following in the text to be compressed. The new string is added into the dictionary. The new entry is coded as , where is the index of the existing table entry and is the appended symbol.
As an example, consider the string “ ”. It is divided by LZ78 into strings as shown below. String is here the empty string.
Since we are not using a sliding window, there is no limit for how far back strings can reach. However, in practice the dictionary cannot continue to grow infinitely. There are several ways to manage this problem. For instance, after having reached the maximum number of entries in the dictionary, no further entries can be added to the table and coding becomes static. Another variation would be to replace older entries. The decoder knows how many bits must be reserved for the index of the string in the dictionary, and hence decompression is straightforward.
Ziv-Lempel coding asymptotically achieves the best possible compression rate which again is the entropy rate of the source. The source model, however, is much more general than the discrete memoryless source. The stochastic process generating the next letter, is assumed to be stationary (the probability of a sequence does not depend on the instant of time, i. e. for all and all sequences ). For stationary processes the limit exists and is defined to be the entropy rate.
If denotes the number of strings in the parsing process of LZ78 for a text generated by a stationary source, then the number of bits required to encode all these strings is . It can be shown that converges to the entropy rate of the source. However, this would require that all strings can be stored in the dictionary.
If we fix the size of the sliding window or the dictionary, the running time of encoding a sequence of letters will be linear in . However, as usually in data compression, there is a tradeoff between compression rate and speed. A better compression is only possible with larger memory. Increasing the size of the dictionary or the window will, however, result in a slower performance, since the most time consuming task is the search for the matching substring or the position in the dictionary.
Decoding in both LZ77 and LZ78 is straightforward. Observe that with LZ77 decoding is usually much faster than encoding, since the decoder already obtains the information at which position in the history he can read out the next letters of the text to be recovered, whereas the encoder has to find the longest matching substring in the history window. So algorithms based on LZ77 are useful for files which are compressed once and decompressed more frequently.
Further, the encoded text is not necessarily shorter than the original text. Especially in the beginning of the encoding the coded version may expand a lot. This expansion has to be taken into consideration.
For implementation it is not optimal to represent the text as an array. A suitable data structure will be a circular queue for the lookahead window and a binary search tree for the history window in LZ77, while for LZ78 a dictionary tree should be used.
3.3-1 Apply the algorithms LZ77 and LZ78 to the string “abracadabra”.
3.3-2 Which type of files will be well compressed with LZ77 and LZ78, respectively? For which type of files are LZ77 and LZ78 not so advantageous?
3.3-3 Discuss the advantages of encoding the first or the last offset, when several matching substrings are found in LZ77.