Compression Algorithm for numbers

Forum

Forum
General C++ Programming
Compression Algorithm for numbers

Compression Algorithm for numbers

paul198204 (3)

Hi,

I am looking for a compression algorithm which compress sequence of random numbers (will be in sorted order but some of the numbers may be missing). The compressed data will be sent to other component where decompression will take place.

I am new to compression techniques and any help would be appreciated.

Thanks,
Paul

m4ster r0shi (2201)

Since, apparently [1], you've been struggling with this for almost three
months without any progress, here's something to get you started:

A simple method you can use is run-length encoding [2].
Suppose you have the following number sequence:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

An alternative representation to the above is to write it as the first
element followed by the differences between consecutive elements:

1, 1, 1, 1, 1, 1, 1, 1, 1, 1

You can now use run-length encoding to compress the above into:

1, 10

There are several improvements you can make here.
Consider the following sequence:

1, 2, 3, 4, 5, 7, 10, 12, 13, 15

The sequence of differences is:

1, 1, 1, 1, 1, 2, 3, 2, 1, 2

And the run-length encoding is:

1, 5, 2, 1, 3, 1, 2, 1, 1, 1, 2, 1

This is a problem. The compressed version of your number list is actually bigger than the original one. Notice that the problem is the second half of the original list. Wouldn't it be great if you could somehow have the first half compressed and leave the second part uncompressed? I have good news for you: You can do that! Since -1 can't be an element of the sequence you have to compress, you can use it to signal interpretation changes in the compressed sequence:

1, 5, -1, 2, 3, 2, 1, 2

Actually, you can use all negative numbers to signal similar changes in the compressed sequence. For example, you could use -13 to indicate that, in the following 4 entries, the interpretation changes, remains the same, changes, and then changes again (since 13 is 1101 in binary). This could be useful in encoding, for example:

1, 2, 3, 4, 6, 9, 10, 11, 12, 13, 16, 20, 25

1, 1, 1, 1, 2, 3, 1, 1, 1, 1, 3, 4, 5

1, 4, -13, 2, 3, 1, 4, 3, 4, 5

Using only -1 you would have to write it as:

1, 4, -1, 2, 3, -1, 1, 4, -1, 3, 4, 5

And without negative numbers you'd have to write it as:

1, 4, 2, 1, 3, 1, 1, 4, 3, 1, 4, 1, 5, 1

Also, notice that since negative powers of 2 are actually
shifted -1s, you could assign a different meaning to them.

Here's some python code that implements the simplest version of the compression method I described above:

from itertools import chain, groupby
from random import random

maxNumber = 20

numbers = [n for n in range(1, maxNumber + 1) if random() < 0.75]

def seq2dif(l):
    return [l[0]] + [l[i + 1] - l[i] for i in range(len(l) - 1)]

def dif2seq(l):
    c = 0
    s = []

    for i in range(len(l)):
        c += l[i]
        s.append(c)

    return s

def flatten(l):
    return list(chain.from_iterable(l))

def encode(l):
    return flatten([[k, len(list(g))] for k, g in groupby(l)])

def decode(l):
    return flatten([[l[i * 2]] * l[i * 2 + 1] for i in range(int(len(l) / 2))])

print(numbers)

compressed = encode(seq2dif(numbers))
print(compressed)

print(dif2seq(decode(compressed)))

http://ideone.com/WqvrQw

Try to rewrite it in a language you're familiar with (perhaps C++). Then, try to extend it using negative numbers. And when you're done playing, go check the links Catfish666 posted in your other thread.

[1] http://www.cplusplus.com/forum/general/119456/
[2] http://en.wikipedia.org/wiki/Run-length_encoding

Last edited on

JLBorges (13770)

> I am looking for a compression algorithm which compress sequence of random numbers
> (will be in sorted order but some of the numbers may be missing).

1. Is there an upper bound on the largest number?

2. Are these random numbers unique? (there are no repeated numbers)

3. Would the count of missing numbers be smaller than the count of numbers that are present in the sorted sequence?

paul198204 (3)

Hi,

Thank you m4ster r0shi & JLBorges for your replies.

m4ster r0shi,

I understood the delta encoding you are referring to and will try it. But before that I would like to know whether "LZW" algorithm which is dictionary based helps in this scenario?

Since I am working on a product which requires fast algorithm, I am looking for the better option before I proceed further.

JLBorges,

1. Is there an upper bound on the largest number?

As of now it will be 8192. But in future it may increase little bit

2. Are these random numbers unique? (there are no repeated numbers)

Yes

3. Would the count of missing numbers be smaller than the count of numbers that are present in the sorted sequence?

Not really

Thanks,
Praveen V

JLBorges (13770)

We could represent the subset of N (8192) numbers by a set of N (8192) bits; with a one bit indicating that the number is present in the subset.

The compress function could be something like this:

#include <iostream>
#include <bitset>
#include <vector>
#include <cstdint>
#include <algorithm>

constexpr std::size_t NBITS = 8 ;
using octet = std::bitset<NBITS> ;

std::vector<std::uint8_t> compress( const std::vector<unsigned int>& numbers )
{
    std::vector<std::uint8_t> compact ;

    if( !numbers.empty() )
    {
        const std::size_t maxv = *std::max_element( numbers.begin(), numbers.end() ) ;
        std::vector<octet> bits( maxv/NBITS + 1 ) ;
        for( unsigned int v : numbers ) bits[v/NBITS][ NBITS - v%NBITS - 1 ] = true ;

        for( const octet& i8 : bits ) compact.push_back( i8.to_ulong() ) ;
    }

    return compact ;
}

int main()
{
    std::vector<unsigned int> numbers ;
    for( unsigned int i = 0 ; i < 512 ; ++i ) if( i%3) numbers.push_back(i) ;

    auto compact = compress(numbers) ;
    
    int n = 0 ;
    for( std::uint8_t byte : compact )
    {
        std::cout << octet(byte) << ' ' ;
        if( ++n % 8 == 0 ) std::cout << '\n' ;  
    }
    std::cout << '\n' ;
}

http://coliru.stacked-crooked.com/a/722bee36d17940bc

m4ster r0shi (2201)

In that case, you can use what JLBorges suggested instead of delta encoding as a preprocessing step. It already provides a very compact representation, but it can be reduced further (even with just a Burrows–Wheeler transform [1] and run-length encoding [2]) in certain cases, e.g. when the bit stream exhibits repeated patterns, as in the example presented above.

from itertools import chain, groupby
from random import random

upperBound = 512

#numbers = [n for n in range(0, upperBound) if random() < 0.5]
numbers = [n for n in range(0, upperBound) if n % 3 != 0]

bits = ''.join([str(1 if n in numbers else 0) for n in range(0, upperBound)])

def flatten(l):
    return list(chain.from_iterable(l))

def encode(l):
    return flatten([[k, len(list(g))] for k, g in groupby(l)])

def decode(l):
    return flatten([[l[i * 2]] * l[i * 2 + 1] for i in range(int(len(l) / 2))])

def bwt(s):

    assert "^" not in s, "Input string cannot contain '^'"

    s += "^"  # Add end of file marker
    table = sorted(s[i:] + s[:i] for i in range(len(s)))  # Table of rotations of string
    last_column = [row[-1:] for row in table]  # Last characters of each row

    return "".join(last_column)  # Convert list of characters into string

def ibwt(r, *args):

        firstCol = "".join(sorted(r))
        count = [0]*256
        byteStart = [-1]*256
        output = [""] * len(r)
        shortcut = [None]*len(r)

        #Generates shortcut lists
        for i in range(len(r)):
                shortcutIndex = ord(r[i])
                shortcut[i] = count[shortcutIndex]
                count[shortcutIndex] += 1
                shortcutIndex = ord(firstCol[i])
                if byteStart[shortcutIndex] == -1:
                        byteStart[shortcutIndex] = i

        localIndex = (r.index("^") if not args else args[0])

        for i in range(len(r)):
                #takes the next index indicated by the transformation vector
                nextByte = r[localIndex]
                output [len(r)-i-1] = nextByte
                shortcutIndex = ord(nextByte)
                #assigns localIndex to the next index in the transformation vector
                localIndex = byteStart[shortcutIndex] + shortcut[localIndex]

        return "".join(output).rstrip("^")

ppbits = [bits[i : i + 8] for i in range(0, len(bits), 8)]
ppbits = [ppbits[i : i + 8] for i in range(0, len(ppbits), 8)]

print '\n'.join(map(lambda l : ' '.join(l), ppbits)), '\n'

compressed = encode(list(bwt(bits)))
print compressed, '\n'

bits_ = ibwt(''.join(decode(compressed)))
numbers_ = [n for n in range(len(bits_)) if bits_[n] == '1']

print bits == bits_ and numbers == numbers_

http://ideone.com/gY4v3w

[1] http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
[2] http://en.wikipedia.org/wiki/Run-length_encoding

Last edited on

Topic archived. No new replies allowed.