### Compression Algorithm for numbers

Hi,

I am looking for a compression algorithm which compress sequence of random numbers (will be in sorted order but some of the numbers may be missing). The compressed data will be sent to other component where decompression will take place.

I am new to compression techniques and any help would be appreciated.

Thanks,
Paul
Since, apparently [1], you've been struggling with this for almost three
months without any progress, here's something to get you started:

A simple method you can use is run-length encoding [2].
Suppose you have the following number sequence:

 `1, 2, 3, 4, 5, 6, 7, 8, 9, 10`

An alternative representation to the above is to write it as the first
element followed by the differences between consecutive elements:

 `1, 1, 1, 1, 1, 1, 1, 1, 1, 1`

You can now use run-length encoding to compress the above into:

 `1, 10`

There are several improvements you can make here.
Consider the following sequence:

 `1, 2, 3, 4, 5, 7, 10, 12, 13, 15`

The sequence of differences is:

 `1, 1, 1, 1, 1, 2, 3, 2, 1, 2`

And the run-length encoding is:

 `1, 5, 2, 1, 3, 1, 2, 1, 1, 1, 2, 1`

This is a problem. The compressed version of your number list is actually bigger than the original one. Notice that the problem is the second half of the original list. Wouldn't it be great if you could somehow have the first half compressed and leave the second part uncompressed? I have good news for you: You can do that! Since -1 can't be an element of the sequence you have to compress, you can use it to signal interpretation changes in the compressed sequence:

 `1, 5, -1, 2, 3, 2, 1, 2`

Actually, you can use all negative numbers to signal similar changes in the compressed sequence. For example, you could use -13 to indicate that, in the following 4 entries, the interpretation changes, remains the same, changes, and then changes again (since 13 is 1101 in binary). This could be useful in encoding, for example:

 ```1, 2, 3, 4, 6, 9, 10, 11, 12, 13, 16, 20, 25 1, 1, 1, 1, 2, 3, 1, 1, 1, 1, 3, 4, 5 1, 4, -13, 2, 3, 1, 4, 3, 4, 5```

Using only -1 you would have to write it as:

 `1, 4, -1, 2, 3, -1, 1, 4, -1, 3, 4, 5`

And without negative numbers you'd have to write it as:

 `1, 4, 2, 1, 3, 1, 1, 4, 3, 1, 4, 1, 5, 1`

Also, notice that since negative powers of 2 are actually
shifted -1s, you could assign a different meaning to them.

Here's some python code that implements the simplest version of the compression method I described above:

 ``1234567891011121314151617181920212223242526272829303132333435`` ``````from itertools import chain, groupby from random import random maxNumber = 20 numbers = [n for n in range(1, maxNumber + 1) if random() < 0.75] def seq2dif(l): return [l[0]] + [l[i + 1] - l[i] for i in range(len(l) - 1)] def dif2seq(l): c = 0 s = [] for i in range(len(l)): c += l[i] s.append(c) return s def flatten(l): return list(chain.from_iterable(l)) def encode(l): return flatten([[k, len(list(g))] for k, g in groupby(l)]) def decode(l): return flatten([[l[i * 2]] * l[i * 2 + 1] for i in range(int(len(l) / 2))]) print(numbers) compressed = encode(seq2dif(numbers)) print(compressed) print(dif2seq(decode(compressed)))``````

http://ideone.com/WqvrQw

Try to rewrite it in a language you're familiar with (perhaps C++). Then, try to extend it using negative numbers. And when you're done playing, go check the links Catfish666 posted in your other thread.

[1] http://www.cplusplus.com/forum/general/119456/
[2] http://en.wikipedia.org/wiki/Run-length_encoding
Last edited on
> I am looking for a compression algorithm which compress sequence of random numbers
> (will be in sorted order but some of the numbers may be missing).

1. Is there an upper bound on the largest number?

2. Are these random numbers unique? (there are no repeated numbers)

3. Would the count of missing numbers be smaller than the count of numbers that are present in the sorted sequence?
Hi,

Thank you m4ster r0shi & JLBorges for your replies.

m4ster r0shi,

I understood the delta encoding you are referring to and will try it. But before that I would like to know whether "LZW" algorithm which is dictionary based helps in this scenario?

Since I am working on a product which requires fast algorithm, I am looking for the better option before I proceed further.

JLBorges,

1. Is there an upper bound on the largest number?

As of now it will be 8192. But in future it may increase little bit

2. Are these random numbers unique? (there are no repeated numbers)

Yes

3. Would the count of missing numbers be smaller than the count of numbers that are present in the sorted sequence?

Not really

Thanks,
Praveen V
We could represent the subset of N (8192) numbers by a set of N (8192) bits; with a one bit indicating that the number is present in the subset.

The compress function could be something like this:

 ``12345678910111213141516171819202122232425262728293031323334353637383940`` ``````#include #include #include #include #include constexpr std::size_t NBITS = 8 ; using octet = std::bitset ; std::vector compress( const std::vector& numbers ) { std::vector compact ; if( !numbers.empty() ) { const std::size_t maxv = *std::max_element( numbers.begin(), numbers.end() ) ; std::vector bits( maxv/NBITS + 1 ) ; for( unsigned int v : numbers ) bits[v/NBITS][ NBITS - v%NBITS - 1 ] = true ; for( const octet& i8 : bits ) compact.push_back( i8.to_ulong() ) ; } return compact ; } int main() { std::vector numbers ; for( unsigned int i = 0 ; i < 512 ; ++i ) if( i%3) numbers.push_back(i) ; auto compact = compress(numbers) ; int n = 0 ; for( std::uint8_t byte : compact ) { std::cout << octet(byte) << ' ' ; if( ++n % 8 == 0 ) std::cout << '\n' ; } std::cout << '\n' ; }``````

http://coliru.stacked-crooked.com/a/722bee36d17940bc
In that case, you can use what JLBorges suggested instead of delta encoding as a preprocessing step. It already provides a very compact representation, but it can be reduced further (even with just a Burrowsâ€“Wheeler transform [1] and run-length encoding [2]) in certain cases, e.g. when the bit stream exhibits repeated patterns, as in the example presented above.

 ``12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970`` ``````from itertools import chain, groupby from random import random upperBound = 512 #numbers = [n for n in range(0, upperBound) if random() < 0.5] numbers = [n for n in range(0, upperBound) if n % 3 != 0] bits = ''.join([str(1 if n in numbers else 0) for n in range(0, upperBound)]) def flatten(l): return list(chain.from_iterable(l)) def encode(l): return flatten([[k, len(list(g))] for k, g in groupby(l)]) def decode(l): return flatten([[l[i * 2]] * l[i * 2 + 1] for i in range(int(len(l) / 2))]) def bwt(s): assert "^" not in s, "Input string cannot contain '^'" s += "^" # Add end of file marker table = sorted(s[i:] + s[:i] for i in range(len(s))) # Table of rotations of string last_column = [row[-1:] for row in table] # Last characters of each row return "".join(last_column) # Convert list of characters into string def ibwt(r, *args): firstCol = "".join(sorted(r)) count = [0]*256 byteStart = [-1]*256 output = [""] * len(r) shortcut = [None]*len(r) #Generates shortcut lists for i in range(len(r)): shortcutIndex = ord(r[i]) shortcut[i] = count[shortcutIndex] count[shortcutIndex] += 1 shortcutIndex = ord(firstCol[i]) if byteStart[shortcutIndex] == -1: byteStart[shortcutIndex] = i localIndex = (r.index("^") if not args else args[0]) for i in range(len(r)): #takes the next index indicated by the transformation vector nextByte = r[localIndex] output [len(r)-i-1] = nextByte shortcutIndex = ord(nextByte) #assigns localIndex to the next index in the transformation vector localIndex = byteStart[shortcutIndex] + shortcut[localIndex] return "".join(output).rstrip("^") ppbits = [bits[i : i + 8] for i in range(0, len(bits), 8)] ppbits = [ppbits[i : i + 8] for i in range(0, len(ppbits), 8)] print '\n'.join(map(lambda l : ' '.join(l), ppbits)), '\n' compressed = encode(list(bwt(bits))) print compressed, '\n' bits_ = ibwt(''.join(decode(compressed))) numbers_ = [n for n in range(len(bits_)) if bits_[n] == '1'] print bits == bits_ and numbers == numbers_``````

http://ideone.com/gY4v3w

[1] http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
[2] http://en.wikipedia.org/wiki/Run-length_encoding
Last edited on
Topic archived. No new replies allowed.