I need help with LZW v.6

If you've seen my LZW article, version 6 will be the first beta.

The goal of v.6 is to introduce the concept of variable width codes, that is, reading and writing compression codes that have an arbitrary width in bits: starting at 9 and going up to 10, 11, 12 and so on, to 32 (which won't be reached because of a hard-coded dictionary limit).

The problem is, for the life of me I can't figure out how to shift and mask the bits to achieve this effect. I feel as dumb as a brick.

Now, someone suggested an elegant solution that uses std::bitset and std::string by converting one to another, but my intuition tells me it will be a huge performance killer if I try it.

Honestly, I'd like some straight code that I may use in my program, under the same license I used for the rest of the code, and be done with it.

Otherwise, pseudocode explaining the process is good too.

If you're interested to help, you may consider coding the get() and put() member functions below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
using CodeType = std::uint32_t;

class CodeReader {
    // ...

    // you may add other members if needed, maybe used to store progress?
    std::istream &is;
    std::size_t bits; // current bit width, use it but don't change it
};

bool CodeReader::get(CodeType &k)
{
    // ???
    return is;
}

class CodeWriter {
    // ...

    std::ostream &os;
    std::size_t bits;
};

bool CodeWriter::put(CodeType k)
{
    // ???
    return os;
}


Thank you.
You need to add a filter to the output stream that allows you to output only a specific number of bits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class CodeWriter
{
    std::ostream  &os;        // stream where bits are eventually written
    unsigned char bit_cache;  // bits yet to be written
    unsigned      bit_count;  // number of bits in the cache

public:
    CodeWriter(std::ostream& os): os(os), bit_cache(0), bit_count(0) 
    {
    }

    ~CodeWriter()
    {
        if (bit_count != 0) put(0, 8);  // flush any remaining bits
    }

    CodeWriter& put(CodeType k, unsigned bits)
    {
        while (bits != 0)
        {
            // move as many bits from k to bit_cache as possible
            bit_cache |= (k << bit_count);
            k >>= (8 - bit_count);

            // adjust bit counts
            if ((bit_count + bits) < 8)
            {
                bit_count += bits;
                bits = 0;
            }
            else
            {
                bits -= (8 - bit_count);

                // flush the bit cache to stream
                os.put((char)bit_cache);

                bit_cache = 0;
                bit_count = 0;
            }
        }

        return *this;
    }
};

I've done this often enough that I just typed this in off the top of my head -- but I didn't test it first, so there's probably some dumb error in there.

Unpacking works the opposite way. First bits received fill from bit 0 to bit 7.

Hope this helps.

BTW Please be consistent with your { }s.
Last edited on
Thank you, Duoas. I'm currently trying to put together the CodeReader class.

I disagree on the minor issue of curly brace inconsistency. To me, it's like

1
2
3
4
5
6
if ()
while ()
for ()
sizeof ()
static_cast<> ()
// ... 


versus

1
2
3
4
5
function1()
function2()
function3()
tfunction<>()
// ... 


I use mixed rules for curly braces, but I use them consistently.
Of course, if more people bring this up as a readability issue, I will amend it.
Can anyone see any logical problem with this?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
bool read(CodeType &k, std::size_t bits)
{
    k = bit_cache;
    bits -= bit_count;

    while (bits != 0)
    {
        char temp;

        is.get(temp);

        if (bits >= CHAR_BIT)
        {
            k <<= CHAR_BIT;
            k |= static_cast<unsigned char> (temp);
            bits -= CHAR_BIT;
        }
        else
        {
            k <<= bits;
            k |= static_cast<unsigned char> (temp) >> (CHAR_BIT - bits);

            const unsigned char mask = UCHAR_MAX >> bits;

            bit_cache = temp & mask;
            bit_count = CHAR_BIT - bits;
            bits = 0;
        }
    }

    return is;
}

Now I can see your code has a Little Endian style, and mine is Big Endian. No wonder they don't work together.

Duoas, could you please give your reader too? I admit this is too difficult for me.
1. bits may be less than bit_count
2. don't source data if you don't need it (at just a quick glance -- it might not if 1 is satisfied)

Give me a bit to respond with code.


[edit] sorry, I pulled this up earlier but had to run to give my toddler a bath -- she got into the pantry and dumped flour all over everything :-J

[edit 2] sorry again. I'll try to get something up for you tonight.

There is no 'endian' when packing bits to a byte stream -- there is 'in order' and 'out of order'. Bits added later get tacked on at MSB. That is, first bit is bit 0 in first byte of stream. Last bit is bit N in byte M of stream.

The reason is that the decoder may not know how many bits to pull when reading. If you pack the stream in any other order you must also unpack it the same way or your data will be corrupted.

Hope this helps.
Last edited on
I'll try to get something up for you tonight.

Much appreciated!
I think I've reached my limit, and I'll take a break after the next version of the article is published. Your help decides if v.6 will be here sooner... or much later.

There is no 'endian' when packing bits to a byte stream -- there is 'in order' and 'out of order'. Bits added later get tacked on at MSB. That is, first bit is bit 0 in first byte of stream. Last bit is bit N in byte M of stream.

I meant "endian" figuratively. I thought of it when I noticed:
for 111011101,

your code writes: 11011101 00000001
my code reads: 11101110 10000000

Sorry to be so long. I had to type something up because I still cannot afford to recover my data from my old hard drives (both of which crashed at the same time some months ago -- holy cow, like a year ago).

I meant "endian" figuratively"

I know what you meant. :o\

I think that MSB packing makes little sense, so I stick to LSB packing.
http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch#Packing_order
(I wrote my first LZW decoder to handle GIF files.)

Well, here's what I came up with for you. Again, not tested.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
std::size_t // Returns the number of bits read. Returns zero on error or EOF.
read(
    CodeType&    k,         // result
    std::size_t  bits,      // number of bits to read
    std::size_t& k_count )  // number of bits already used in k
{
    static unsigned char MASKS[] = { 0x00, 0x01, 0x03, 0x07, 0x0F, 0x1F, 0x3F, 0x7F, 0xFF };

    std::size_t initial_k_count = k_count;

    // We cannot decode more bits than we have space to store them
    bits = std::min( bits, MAX_CODETYPE_BIT_COUNT - k_count );

    while (true)
    {
        // Transfer as many bits as needed/available from bit_cache to k
        std::size_t n = std::min( bit_count, bits );

        k |= (CodeType)(bit_cache & MASKS[ n ]) << k_count;

        // Safely update our cache data
        k_count    += n;
        bits       -= n;

        bit_count  -= n;
        bit_cache >>= n;

        // Note: at this point exactly one of {bits, bit_count} is zero.

        // Only source more data if we need it; quit on error
        if (!bits or !is.get((char&)bit_cache))
        {
            break;
        }
        bit_count = 8;
    }

    return k_count - initial_k_count;
}

std::size_t // Returns the number of bits read == number of bits in k. Returns zero on error or EOF.
read(
    CodeType& k,        // result
    std::size_t bits )  // number of bits to read
{
    std::size_t k_count = 0;
    k = 0;
    return read( k, bits, k_count );
}

I don't know your coding style, so I hope it isn't too difficult to adjust for you.

Hope this helps.
Thank you for your time and effort.

I went on to butcher your code, adapting it to my style and to the CodeReader class.

The compressor seems to work fine. The decompressor however, doesn't.

The program signals the decompressor didn't find the special code MetaCode::Eof, which is written by the compressor at the end of the file.

The resulting decompressed file is exactly the same size as the compressed file, and its contents consist of the byte 0x80 repeated over and over.

Here is the current code. Thanks again.
http://pastebin.com/UiZ1yPMa
Hmm... shouldn't we read to an intermediary value instead of bit_cache?

1
2
3
4
5
6
7
8
9
        bit_cache >>= n;

        // Note: at this point exactly one of {bits, bit_count} is zero.

        // Only source more data if we need it; quit on error
        if (!bits or !is.get((char&)bit_cache))
        {
            break;
        }

I gave you very general code... I expected you to adapt it to your specific needs. You should be as specific as possible -- extra operations here (such as an un-inlined, overloaded function call) will cost you time in the decompression.

If bit_cache is an unsigned char, then there is no need to read to an intermediary.

I'm not sure that (char&) shouldn't be a (char). Again, I didn't test the code -- I could have made a stupid mistake somewhere. (That usually happens to me.)

I'll take a look at your code later tonight.
I think I finally got it right.

Duoas (and anybody else reading), if you could spare some time for code review and testing, I'd be very thankful. I'm interested in ideas to improve performance, since I see my code is not too elegant.

FYI I went the MSB-First way, I think.

Thanks!

http://pastebin.com/v1F4vN4h
Last edited on
Actually I didn't, it explodes if it ever needs to reset the dictionary.

One can test easily by setting globals::dms to a very tight limit, such as 1024. The decompressor will choke on an invalid code.

Edit: current version of the code...
http://pastebin.com/8Va6QWhB
Last edited on
Alas, sorry I haven't responded sooner. Where are you now? Do you still want code review, etc?
Do you still want code review, etc?

Absolutely. I need the feedback in order to perfect the existing article and code.
I just don't plan to add new content for a while.

Latest version: http://cplusplus.com/articles/iL18T05o/

Version 6 in particular is hideous so let me know if you find bugs.
Thanks.
Sorry again that I just don't have much time to dink with this....

[disclaimer: I've really only glanced through the code and article. I have not put the code through any stress tests.]

The code looks good, so no complaints. I'm glad you lowered the dictionary limit. It is true that the dictionary can be arbitrarily large, but there is an upper limit to where growth is actually useful. You've currently got it set to that upper limit (20 bits). Most LZW systems get along just fine at 12 bits. You might want to mention this kind of stuff in your article.

I have more of an issue with the article itself. It is very wordy, and at time rambles along with your effort to figure this stuff out. People reading the article don't need to know that stuff.

Cut out all the non-essential rambling and trying-to-figure-it-out stuff, and leave it only with the cold technical details of what you have chosen to do.

Don't mention stuff like endianness (as you are only mentioning it to say that it has nothing to do with the article -- get it?).

You also mention in your code that keeping track of the current bit size complicates your code. Well, yes, but your code knows a lot about it. You should have kept that stuff in a class so that it can take care of itself. (IMHO.)

For example, here's an fairly elegant implementation using C++11
http://marknelson.us/2011/11/08/lzw-revisited/
Notice how his input_foo_stream<> and output_foo_stream<> classes handle this so that the calling code remains agnostic about it. (Though there should be some care to watch for things like breached dictionary size, etc.)

Don't aim users at some specific version of a MinGW compiler. Test your code in at least MSVC and a standard MinGW and a standard (non-Windows) GCC.

Sorry I'm not more help.

[edit] Oh yeah... make sure you stress test your code with evil files!
Last edited on
Don't aim users at some specific version of a MinGW compiler. Test your code in at least MSVC and a standard MinGW and a standard (non-Windows) GCC.

The article is not just about LZW, it's also about promoting C++11. Screw MSVC. I'll reconsider my stance when they release a proper C++11 compiler.

I have more of an issue with the article itself. It is very wordy, and at time rambles along with your effort to figure this stuff out. People reading the article don't need to know that stuff.

Cut out all the non-essential rambling and trying-to-figure-it-out stuff, and leave it only with the cold technical details of what you have chosen to do.

I would much appreciate it, if you could point out some of the segments in question.

Notice how his input_foo_stream<> and output_foo_stream<> classes handle this so that the calling code remains agnostic about it.

I'll probably do this for Version 7, thanks. The program currently has file expansion problems which I want to fix. I read on Wikipedia that UNIX compress monitors the compression ratio and writes a clear code when appropriate, and I want the future version to mimic this.

Edit: I apologize for being incoherent. Reading this post I realized I'm all over the place.
Last edited on
Topic archived. No new replies allowed.