A technique related to Huffman coding is *Shannon-Fano coding*, which works as follows:
The technique of *arithmetic coding* does not have this restriction: It achieves the same effect as treating the message as one single unit (a technique which would, for Huffman coding, require enumeration of every single possible message), and thus attains the theoretical entropy bound to compression efficiency for any source.
Arithmetic coding works by representing a number by an interval of real numbers between 0 and 1. As the message becomes longer, the interval needed to represent it becomes smaller and smaller, and the number of bits needed to specify that interval increases. Successive symbols in the message reduce this interval in accordance with the probability of that symbol. The more likely symbols reduce the range by less, and thus add fewer bits to the message.
1 Codewords +-----------+-----------+-----------+ /-----\ | |8/9 YY | Detail |<- 31/32 .11111 | +-----------+-----------+<- 15/16 .1111 | Y | | too small |<- 14/16 .1110 |2/3 | YX | for text |<- 6/8 .110 +-----------+-----------+-----------+ | | |16/27 XYY |<- 10/16 .1010 | | +-----------+ | | XY | | | | | XYX |<- 4/8 .100 | |4/9 | | | +-----------+-----------+ | | | | | X | | XXY |<- 3/8 .011 | | |8/27 | | | +-----------+ | | XX | | | | | |<- 1/4 .01 | | | XXX | | | | | |0 | | | +-----------+-----------+-----------+As an example of arithmetic coding, lets consider the example of two symbols X and Y, of probabilities 0.66 and 0.33. To encode this message, we examine the first symbol: If it is a X, we choose the lower partition; if it is a Y, we choose the upper partition. Continuing in this manner for three symbols, we get the codewords shown to the right of the diagram above - they can be found by simply taking an appropriate location in the interval for that particular set of symbols and turning it into a binary fraction. In practice, it is also necessary to add a special end-of-data symbol, which is not represented in this simpe example. In this case the arithmetic code is not completely efficient, which is due to the shortness of the message - with longer messages the coding efficiency does indeed approach 100%.
Now that we have an efficient encoding technique, what can we do with it? What we need is a technique for building a model of the data which we can then use with the encoder. The simplest model is a fixed one, for example a table of standard letter frequencies for English text which we can then use to get letter probabilities. An improvement on this technique is to use an *adaptive model*, in other words a model which adjusts itself to the data which is being compressed as the data is compressed. We can convert the fixed model into an adaptive one by adjusting the symbol frequencies after each new symbol is encoded, allowing the model to track the data being transmitted. However, we can do much better than that.
Using the symbol probabilities by themselves is not a particularly good estimate of the true entropy of the data: We can take into account intersymbol probabilities as well. The best compressors available today take this approach: DMC (Dynamic Markov Coding) starts with a zero-order Markov model and gradually extends this initial model as compression progresses; PPM (Prediction by Partial Matching) looks for a match of the text to be compressed in an order-n context. If no match is found, it drops to an order n-1 context, until it reaches order 0. Both these techniques thus obtain a much better model of the data to be compressed, which, combined with the use of arithmetic coding, results in superior compression performance.
So if arithmetic coding-based compressors are so powerful, why are they not used universally? Apart from the fact that they are relatively new and haven't come into general use too much yet, there is also one major concern: The fact that they consume rather large amounts of computing resources, both in terms of CPU power and memory. The building of sophisticated models for the compression can chew through a fair amount of memory (especially in the case of DMC, where the model can grow without bounds); and the arithmetic coding itself involves a fair amount of number crunching. There is however an alternative approach, a class of compressors generally referred to as *substitutional* or *dictionary-based compressors*.
Input string: /WED/WE/WEE/WEB Character input: Code output: New code value and associated string: /W / 256 = /W E W 257 = WE D E 258 = ED / D 259 = D/ WE 256 260 = /WE / E 261 = E/ WEE 260 262 = /WEE /W 261 263 = E/W EB 257 264 = WEB *END* BLZW starts with a 4K dictionary, of which entries 0-255 refer to individual bytes, and entries 256-4095 refer to substrings. Each time a new code is generated it means a new string has been parsed. New strings are generated by appending the current character K to the end of an existing string w. The algorithm for LZW compression is as follows:
set w = NIL loop read a character K if wK exists in the dictionary w = wK else output the code for w add wK to the string table w = K endloopA sample run of LZW over a (highly redundant) input string can be seen in the diagram above. The strings are built up character-by-character starting with a code value of 256. LZW decompression takes the stream of codes and uses it to exactly recreate the original input data. Just like the compression algorithm, the decompressor adds a new string to the dictionary each time it reads in a new code. All it needs to do in addition is to translate each incoming code into a string and send it to the output. A sample run of the LZW decompressor is shown in below.
Input code: /WED<256>E<260><261><257>B Input code: Output string: New code value and associated string: / / W W 256 = /W E E 257 = WE D D 258 = ED 256 /W 259 = D/ E E 260 = /WE 260 /WE 261 = E/ 261 E/ 262 = /WEE 257 WE 263 = E/W B B 264 = WEBThe most remarkable feature of this type of compression is that the entire dictionary has been transmitted to the decoder without actually explicitly transmitting the dictionary. At the end of the run, the decoder will have a dictionary identical to the one the encoder has, built up entirely as part of the decoding process.
LZW is more commonly encountered today in a variant known as LZC, after its use in the UNIX "compress" program. In this variant, pointers do not have a fixed length. Rather, they start with a length of 9 bits, and then slowly grow to their maximum possible length once all the pointers of a particular size have been used up. Furthermore, the dictionary is not frozen once it is full as for LZW - the program continually monitors compression performance, and once this starts decreasing the entire dictionary is discarded and rebuilt from scratch. More recent schemes use some sort of least-recently-used algorithm to discard little-used phrases once the dictionary becomes full rather than throwing away the entire dictionary.
Finally, not all schemes build up the dictionary by adding a single new character to the end of the current phrase. An alternative technique is to concatenate the previous two phrases (LZMW), which results in a faster buildup of longer phrases than the character-by-character buildup of the other methods. The disadvantage of this method is that a more sophisticated data structure is needed to handle the dictionary.
[A good introduction to LZW, MW, AP and Y coding is given in the yabba package. For ftp information, see question 2 in part one, file type .Y]
while( lookAheadBuffer not empty ) { get a pointer ( position, match ) to the longest match in the window for the lookahead buffer; if( length > MINIMUM_MATCH_LENGTH ) { output a ( position, length ) pair; shift the window length characters along; } else { output the first character in the lookahead buffer; shift the window 1 character along; } }Decompression is simple and fast: Whenever a ( position, length ) pair is encountered, go to that ( position ) in the window and copy ( length ) bytes to the output.
Sliding-window-based schemes can be simplified by numbering the input text characters mod N, in effect creating a circular buffer. The sliding window approach automatically creates the LRU effect which must be done explicitly in LZ78 schemes. Variants of this method apply additional compression to the output of the LZSS compressor, which include a simple variable-length code (LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding (ZIP 1.x)), all of which result in a certain degree of improvement over the basic scheme, especially when the data are rather random and the LZSS compressor has little effect.
Recently an algorithm was developed which combines the ideas behind LZ77 and LZ78 to produce a hybrid called LZFG. LZFG uses the standard sliding window, but stores the data in a modified trie data structure and produces as output the position of the text in the trie. Since LZFG only inserts complete *phrases* into the dictionary, it should run faster than other LZ77-based compressors.
All popular archivers (arj, lha, zip, zoo) are variations on the LZ77 theme.