Algorithm shopping: Economic object transfer (lossless compression)

This isn't really about C++ but I imagine the fine fellows here are probably my best bet!

Proprietary system (some weird SoC) - no ready tools for this
Proprietary programming language (looks like old Python but not really) - no useful libraries for my case

The system needs to send serialized objects over the net. It has good compute power but very little bandwidth.

I was thinking to compress all objects locally before dispatching but I have a requirement: "We need to try and get at least some objects, if we can, in case the connection drops"

I don't know of any archive format where chunks can be decompressed even if not all of them are present (not without redundancy, which is frowned upon here).

So, I am thinking: I wish I could "open" the archive... I'd send the dictionary first and then each compressed object. The receiver will be able to decompress object-by-object and thus achieve some survivability.

This sounds like I have to write my own compression, in order to have control on what I do with the key and objects.
Of course, I am not aiming to beat 7zip... If I get under 90%, I think this will satisfy my requirement, for this iteration.

My biggest concern right now is I haven't studied any compression algorithms.
These objects are binary - no standard format.
They are 10KB-10MB of size.
Identical parts between objects can be 5%-30%

I know compression algorithms like to start by finding the "largest repeating pattern" and then find smaller and smaller patters.
I can't think of a way of doing this that isn't basically brute-force and probably will take years to compress a few large files...

Any ideas or mentions of known algo's will be greatly appreciated!
Last edited on
You shouldn't use an archive format. Archives are meant to store and transmit directory structures. All you need to transmit is an array of bytes (the dictionary) and then a linear sequence of arrays of bytes. A raw streaming compression library (where you give chunks of your uncompressed bitstream to the library and it gives you back chunks of a single compressed bitstream) is enough for this. I think any random single threaded compression library will get as much data as possible if the data stream cuts off. LZ4, zlib, BZip2 liblzma all work like this, if I'm not mistaken.

Conceptually, your code will look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
void send_buffer(Compressor &c, Socket &s, const std::vector<byte_t> &buffer){
    auto n = buffer.size();
    for (size_t offset = 0; offset < n;){
        if (c.can_take_more()){
            auto bytes_read = c.input(&buffer[offset], n - offset);
            offset += n;
            continue;
        }
        while (c.has_output())
            s.send(c.output());
    }
    while (c.has_output())
        s.send(c.output());
}

void sender(){
    std::vector<byte_t> dictionary;
    std::vector<std::vector<byte_t>> objects;
    
    Compressor c;
    Socket s;
    
    //I assume dictionary contains some way for the receiver to know its length.
    send_buffer(c, s, dictionary);
    for (auto &o : objects)
        send_buffer(c, s, o);
}

void receiver(){
    Socket s;
    Decompressor d;
    std::vector<byte_t> buffer;
    while (s.good()){
        if (d.can_take_more()){
            d.input(s.receive_some());
            continue;
        }
        while (d.has_output()){
            auto output = d.output();
            //TODO: append output to buffer;
        }
    }
    while (d.has_output()){
        auto output = d.output();
        //TODO: append output to buffer;
    }
    if (s.disconnected() && d.incomplete()){
        //TODO: Stream broken. Handle it.
    }
    //TODO: Decode buffer and split it into dictionary and objects.
}
Thanks :)

However, I don't have these libraries.
I guess I can look into what LZ4, zlib do and try to copy some quick profits from those...
I imagine they will have man-decades of work invested in them and not an easy port...

Is there not some reasonably profitable algorithm I can learn of and implement in that janky language that is available?

Just to show my work: I looked into Run-Length - but that seems more like per-object compression, when the same character is repeated (pictures, etc).
I also looked into Huffman Coding - but that just seems like a way to transform individual characters to more efficient representations.

I feel that I need something that helps to find patterns, which I can build a dictionary from.
Like:
File1: 10001000
File2: 1110001001
Dictionary: # = 1000
Compressed data:
File1: ##
File2: 11#1001

Only methods I can think of to find such patterns are really embarrassing for me to even think I should mention here :D
Last edited on
Is there not some reasonably profitable algorithm I can learn of and implement in that janky language that is available?
Perhaps it's worth considering whether you're falling into premature optimization, given that you have so much work to do and the possibility of introducing subtle bugs into the system. Do you know how long a transfer will take under typical usage and whether this is loo long, by however criterion is used to determine this?

I looked into Run-Length - but that seems more like per-object compression, when the same character is repeated
RLE operates on bitstreams. Since you're dealing with serialized objects, there's nothing preventing you from concatenating them into a single bitstream.
That said, RLE only performs well when the input meets fairly specific statistical properties. If you don't know whether your data will have these properties it's rarely worthwhile to apply RLE.

I feel that I need something that helps to find patterns, which I can build a dictionary from.
The Lempel-Ziv (LZ*) family of algorithms basically work like this. If you want algorithms in this family that aren't too difficult to implement you could look into LZ77, or LZSS. According to Wikipedia, DEFLATE (used by zlib and ZIP) is LZSS plus Huffman. LZ4 is also supposed to be fairly simple.
I had a look... Oh, those were the simple one? :D

OK, time to suck up and do some coding.

Thanks a lot for your assist, this is far and beyond what I expected.
Start by writing the code so it calls a compression function in the source and decompression in the destination. In the first iteration the compression algorithm does nothing. You can play with the rest of the code to see what should be compressed, how many objects you can recover etc.

Once you have all that working, you can think about reimplementing the compression functions to actually compress/decompress the data.

Also, consider the serialization encoding. Do you need to worry about byte order? Alignment constraints? Data formats?
test bzip2 on it as well as 7zip, regular zip, and maybe a couple of others.
bzip2 is very, very slow but the results for some types of data (xml and other structured text esp) is amazing.

OP will have to port the algorithm themselves to the platform, that's why they're reluctant to experimenting much.
Topic archived. No new replies allowed.