My LZW article so far...

http://cplusplus.com/articles/iL18T05o/

I need feedback, constructive criticism, and the like.

I'm particularly interested in ways to simplify and shorten the text while preserving the meaning, and in overall correctness. If I messed up some definitions/concepts, let me know.

Code reviews most welcome.

I'm currently trying to optimize the encoder for speed.
It looks like a real professional article written by a real person, I wish I understood any of it and knew what it was about.

Could I write articles? wonder what I could contribute (nothing so far, maybe what to listen to while coding)
First pass, mostly touching punctuation and grammar:

The LZ family of algorithms
Abraham Lempel and Jacob Ziv published two compression algorithms: LZ77 in 1977 and LZ78 in 1978.

Terry Welch published the LZW algorithm in 1984 as an improvement of LZ78.

LZW is just one of the original LZ algorithm's many derivatives, the more famous ones being LZSS (used in RAR), LZMA (used in 7z), and Deflate (used in ZIP).


Goal
The goal is to implement an LZW file archiver and gradually improve it.

Currently available versions and their brief description:
Here are the successively improved versions of the archiver that accompany this article:
The list of versions that follow should not have periods at the end of each item.

Personally, I would also put a link to the part of the article that discusses each part and to the attached sources. Yes, I know that duplicates data in the TOC and at the end, but it is friendlier to the reader.

Language
I|We will use C++11, the 2011 version of the C++ language's standard.

In order to compile the code snippets and attached source code your C++11 compiler must properly support lambda functions, range-based for() loops, and initializer lists.

Of course, it is entirely possible to modify the code to use the older C++ standards if you really need to.


Lossless Compression
The whole intent behind compression is to make your data fit in less space than it otherwise would. We also wish to be able to recover our original data using only the compressed data. LZW is a lossless compression scheme, meaning that the recovered data is identical to the original data. We can view these points as:

  definitions
  • DSRC is our original data
  • DLZW is our compressed data
  • DREC is our recovered data

  compressable
  • DLZW = compress(DSRC)
  • sizeof(DLZW) < sizeof(DSRC)

  recoverable and lossless
  • DREC = recover(DLZW)
  • DREC == DSRC

Notes:

  • The compression ratio is calculated as: DLZW / DSRC
  • Encode is another term for compress
  • Decode and decompress are other terms for recover
  • You will also see 'uncompress' as a synonym for 'decompress'. Just remember that it is not quite correct to use it as a verb. Uncompressed data is data that has not been compressed.

I always try to avoid anything that looks too strictly mathematical, even if it impacts brevity. (The main target audience of this site are not physicists, and those that are can easily handle the dumbed-down stuff.)

Entropy and dictionary coders
For our purposes entropy is a measurement of the unpredictability of data. Data with low entropy tends to have repeating sequences. Data with high entropy tends to be random.


Alas, I'm out of time for the moment. I'll look at it again later.
Continuing with entropy. I had particular trouble with the readability of the dictionary codes stuff.

I also think that this is the real "meat" of the article - explaining in basic terms what will be refined by the pseudocode and source code.

The LZW algorithm is a "dictionary coder." It works by finding long strings that repeat in the original data and substituting the long string with a much shorter token (or "code"). The list of strings and associated codes is stored in a lookup dictionary.

For example, suppose you have the text:

  "Johnny Appleseed planted many apples."

It would be fair to notice that the string "pples" occurs twice. Let's add that to our dictionary and update our data:

  1 = pples
  "Johnny A1eed planted many a1."


That is much shorter. Already we have achieved compression. If you are paying attention, you might notice we can do better. The strings "pl", "es", "an", "ed", and "n" and are also repeated. (So is "e", but it the only place there is an "ee" is already handled by an "ed".) Let's build a better dictionary and use it on the original data.

  1 = pl
  2 = es
  3 = an
  4 = ed
  5 = nn
  6 = p12
  "Joh5y A6e4 16t4 m3y a6."


That looks pretty good. It saved us 14 characters, a 38% savings. But notice that we did something sneaky with the dictionary in that last example. Entry 6 makes reference to other entries. This is very important in the construction of an LZW dictionary.

The other important thing is how you preserve the dictionary. In our pathetic example, keeping track of the dictionary in addition to our compressed data produces a bigger collection of data than we started with. That's a problem.

The true genious of the LZW algorithm is that it throws away the dictionary when it is done compressing, and reconstructs it while decompressing the data!

The way it does this is pretty slick, and is the subject of the remaining sections in this article.

Notes:

  • The longer the string being replaced, the more the compression.
  • The more often a string is replaced, the more the compression.
  • LZW automatically finds the optimal way to yield the most compression.


Well, that's all you'll get out of me for now. I hope you find it useful.
Thank you, Duoas.
Wonderful article.
Topic archived. No new replies allowed.