The LZ family of algorithms Abraham Lempel and Jacob Ziv published two compression algorithms: LZ77 in 1977 and LZ78 in 1978. Terry Welch published the LZW algorithm in 1984 as an improvement of LZ78. LZW is just one of the original LZ algorithm's many derivatives, the more famous ones being LZSS (used in RAR), LZMA (used in 7z), and Deflate (used in ZIP). |
Goal The goal is to implement an LZW file archiver and gradually improve it. Here are the successively improved versions of the archiver that accompany this article: |
Language I|We will use C++11, the 2011 version of the C++ language's standard. In order to compile the code snippets and attached source code your C++11 compiler must properly support lambda functions, range-based for() loops, and initializer lists. Of course, it is entirely possible to modify the code to use the older C++ standards if you really need to. |
Lossless Compression The whole intent behind compression is to make your data fit in less space than it otherwise would. We also wish to be able to recover our original data using only the compressed data. LZW is a lossless compression scheme, meaning that the recovered data is identical to the original data. We can view these points as: definitions • DSRC is our original data • DLZW is our compressed data • DREC is our recovered data compressable • DLZW = compress(DSRC) • sizeof(DLZW) < sizeof(DSRC) recoverable and lossless • DREC = recover(DLZW) • DREC == DSRC Notes: • The compression ratio is calculated as: DLZW / DSRC • Encode is another term for compress • Decode and decompress are other terms for recover • You will also see 'uncompress' as a synonym for 'decompress'. Just remember that it is not quite correct to use it as a verb. Uncompressed data is data that has not been compressed. |
Entropy and dictionary coders For our purposes entropy is a measurement of the unpredictability of data. Data with low entropy tends to have repeating sequences. Data with high entropy tends to be random. |
The LZW algorithm is a "dictionary coder." It works by finding long strings that repeat in the original data and substituting the long string with a much shorter token (or "code"). The list of strings and associated codes is stored in a lookup dictionary. For example, suppose you have the text: "Johnny Appleseed planted many apples." It would be fair to notice that the string "pples" occurs twice. Let's add that to our dictionary and update our data: 1 = pples "Johnny A1eed planted many a1." That is much shorter. Already we have achieved compression. If you are paying attention, you might notice we can do better. The strings "pl", "es", "an", "ed", and "n" and are also repeated. (So is "e", but it the only place there is an "ee" is already handled by an "ed".) Let's build a better dictionary and use it on the original data. 1 = pl 2 = es 3 = an 4 = ed 5 = nn 6 = p12 "Joh5y A6e4 16t4 m3y a6." That looks pretty good. It saved us 14 characters, a 38% savings. But notice that we did something sneaky with the dictionary in that last example. Entry 6 makes reference to other entries. This is very important in the construction of an LZW dictionary. The other important thing is how you preserve the dictionary. In our pathetic example, keeping track of the dictionary in addition to our compressed data produces a bigger collection of data than we started with. That's a problem. The true genious of the LZW algorithm is that it throws away the dictionary when it is done compressing, and reconstructs it while decompressing the data! The way it does this is pretty slick, and is the subject of the remaining sections in this article. Notes: • The longer the string being replaced, the more the compression. • The more often a string is replaced, the more the compression. • LZW automatically finds the optimal way to yield the most compression. |