What is the most efficient tool to compress data (arrays of unsigned short). It can be time consuming for compression, but needs to be fast for decompression.
And how can I compress a buffer, but not writing it on disk ? Typically, I want to build a file, in which successive compressed buffers are stored.
There is no one "best" compression method. That depends on the data, and whether or not the compression process is allowed to throw away information (i.e., is lossy compression acceptable).
For instance, the repeating sequence 111...11...111 can be compressed by counting the number of 1s (run-length encoding).
More realistically, audio signals have strong periodic components, and this can be exploited by performing Fourier analysis and (for example) throwing away noise and frequency components that a human cannot hear.
Similar, much more complicated algorithms exist for typical video.
"General-purpose" compression algorithms exist too, including algorithms for text. These are generally lossless and are usually worse than a special-purpose algorithm tailored for the data-set. Uniformly random data does not compress well.
for a LARGE array of shorts, a bucket seems appropriate.
short can only have up to 65536 values, right?
so unsigned int bucket[65536] = {0};
and its compressed -- even if you have billions of them, this is all the storage you ever need (well, until you have more than an integer's worth of a single value).
ironically, you can THEN compress the above even more, eg read and write to disk via some flavor of lzw.
this is specific to shorts, though. Beyond bytes and shorts, this quickly becomes not usable; you can't use this for 64 bit integers for example, the buckets are too many.
this loses the ordering of the input. Do you need to keep that? If so, you need a way to retain that info as well.
Ok, so I think I need to explain what I want to do, it will be probably easier.
I am a physicist and studying exotic nuclei gamma ray spectroscopy. I have a huge amount of data (~50To) composed of events. One event, contains multiple gamma-rays energies which are in coincidence.
In my analysis, I need to study which gamma rays are in coincidence. The first step for this is to build a 2D matrix "unsigned short array[8192][8192] = {0}", where the two axis are the energies of the two coincident gamma-ray energies. For each couple of energies (E1,E2), I increment the value : array[E1][E2]++;
This way, if y want to have the energy distributions of the gamma rays in coincidence with energy E, I just have to project the 1D array array[E].
The problem arrive when I need to make 3 g-rays coincidences, because unsigned short array[8192][8192][8192] will explode the memory and output file size. The number of bit of the unsigned shorts are rarely fully occupied, but can be in some cases, and nearly empty in many case (but rarely 0). So it can be easily compressed. But still need to be quickly uncompressed. My idea was then to make 2D array compression, in such a way that when I want to be which energies are in coincidence with E1, and E2, I can move in my file to the position array[E1], decompress the block corresponding to the 2D E1[8192][8192] array, and then project the 1D distribution E2[8192].
If someone has understood something of what I try to explain (sorry if it is not clear), I would like to know how I can compress (using zlib for example, but I didn't find how to use it), a 2D array of unsigned short, in order to store in my output file, 8192 successive compressed 2D arrays
Actually, the statistic we have is huge, the number zero values will be negligible. I have already tried something like this using the THnSparse of the root toolkit which uses this philosophy. But it explode the memory, and the processing time of the projections become extremely long.
even if you condensed it to bytes instead of shorts its too big for a PC. You need a high memory server or a different approach.
So I will ask again... because I didn't fully understand what you are doing...
does some sort of bucket collapse of the data as I described get you what you needed? Even if you needed 3 buckets, its only 12 bytes * 65536 ... not even a gigabyte.
I doubt I'll be able to help, but... what is the size of the input data itself (per event)? 8192 bytes? Much smaller than what you're currently planning to do in the program (a 2D or 3D array)?
Can any destructive processing be done beforehand (e.g. rounding very small values to 0)?
Why do you need to output a full 2D or 3D array? -- Can't you just make a list of the data points that are in coincidence and output that? (I really don't have much clue how it works, so sorry if that's a dumb question).
Is it possible to analyze the input data before you turn it into a 2D or 3D array, to cull large portions of the data that you know aren't part of any coincidence? (e.g. for the third dimension of the array, it can't possibly be in coincidence if the other two dimensions of the array aren't already in coincidence?)
I haven't used zlib directly (only indirectly through PNG files), but if you are having trouble testing your compression hypothesis, try looking through a zlib example https://zlib.net/zlib_how.html
Or is the issue installing/building zlib in the first place?
what % of the data has 'minimal use of the bits' and what % uses more than 1/2 of the bits?
if it uses all the bits, you are looking at 2^48 values. If 90% of the data uses the least significant couple of bits out of 16 for all 3 values, you can process 90% of the data using virtually no memory at all and the other 10% you can handle on the side.... with the happy fact that the data can't cross (you won't get a match of a piece of data in the 10% to a piece in the 90%) so you can truly process them distinctly... if I understood you..?
It would help to know the capabilities of the machine that's going to be doing the analysis.
> unsigned short array[8192][8192][8192]
Yes, this will be 1Tb of memory.
Do you have anywhere near that much memory?
If you did something like this, you could at least find approximately where all the interesting spots are in the data.
1 2 3 4 5 6 7
// This is a mere 16Gb.
unsignedshort array[2048][2048][2048];
// Each array element now counts all the events within a 4x4x4 volume.
// Perhaps clip the result to maximum 0xffff just to stop it rolling over to something
// small and insignificant.
array[E1/4][E2/4][E3/4]++;
When you know where your interesting areas are, you can perform a second pass using more exotic data structures (like a sparse matrix) which only accumulate data on previously determined areas of interest.
I don't really see how to use what you called bucket collapse. In my case, I need to be able to get, in a very short time, the number of events X for which I have E1,E2 and E3 in coincidence (corresponding to the value array[E1][E2][E3] = X in my ideal array).
If I store the data in like you proposed (at least as I understand it), I will have bucket[X]={E1,E2,E3}. It doesn't allow me to get the X value that I need.
A reasonable final compressed output file size for me is around few Go. But the essential things is that I can extract in a fast way the X values
Answer to Ganado:
My input data (per event), is not constant, but is around 1 to 10 integer values. But I have more than 1e12 events... If I need to read all these data each time I want to make a projection (that means get the number of events in which I have E1,E2 and E3 in coincidence), it will takes ages. This operation needs to be done in few seconds maximum. This is why I need to build once, this 3D array (even if it takes 1 week), to then make my projections as many times as needed.
In my input data, each event is composed of around 5 gamma-rays in coincidence. And I need to make all the permutations of 3 in 5 to build my cube. So none of the input data are not in coincidence. (Very sorry if this is not clear... I'm doing the best I can but it's not easy to explain)
Answer to jonnin 2:
I cannot give you these numbers because they are not fixed. If I read only the 1000 first events, or 1e9 events, this will evolve. And I still not have analyzed the full amount of data. But the detectors efficiency is decreasing with the gamma-ray energy, for if E1,E2 and E3 at low energy, we will need the full range of an unsigned short, and when the energies are getting larger, this will decrease.
Thank you for the time you takes to try to understand my problem and to help me.
I want to add that I know that this is feasible. Similar "Cube" are already existing in softwares like "radware" or "gaspware". The processing time for a projection is very fast, and the compressed cube size is around 2-3 Go. But I don't know how they do that
ok, but do you know what percentages of the data are at high vs low energy (even roughly)? It looks to me like you may be able to divide the problem into segments, sounds like by energy level, and process it that way (?). It depends on the data whether this is worth trying.
the buckets are too big, that isnt useful sadly. it would work but you would need 2^48 buckets which is too big for your memory.
Salem's approach is quick but not exact in the first pass. You can do something like his approach to get candidate matches, then do a second pass to resolve if those are exact matches or not, but looking at FAR less data to check it.
if you collapse the shorts into bytes, you only need 2^24 values for the first pass... a few megabytes...
that really seems like a likely approach. I would go with his idea... trying to think whether you need to break up the second pass into chunks, though?
I need to be able to get, in a very short time, the number of events X for which I have E1,E2 and E3 in coincidence (corresponding to the value array[E1][E2][E3] = X in my ideal array).
Does this imply that array[E1][E2][E3] = array[E2][E1][E3] = array[E3][E2][E1] = ... for every permutation of E1, E2, E3?
Is there any special property about the entries array[a][a][a]? What about array[a][a][b]?
I would like to know how I can compress (using zlib for example, but I didn't find how to use it)
You would need to be quite lucky to get anything close to even 50 percent compression of your data with zlib (or any lossless compression) which would still be 100's of GBs. And it would probably be more like 30 (i.e., it would be 70 percent its original size). I don't see how that would help you very much. You need a scheme that takes advantage of the properties of your data.
If I understand you correctly, you want to extract a line of data from the 3D array. Why don't you store it by those lines? I.e., if the 3D array (as 3 2D arrays) was:
I think standard compression is not going to help you solve the problem you have described. It will just slow you down. I think you should go with the idea of weeding out values that have no duplicates at all, which can be done *very* quickly, and then working off what is left.
What if you create 8192 files where the N'th file contains the data for array[N]. Each file might contain triplets E2,E3,X. Store these in binary form and sort them on E2.
Now to get array[E1][E2][E3] the procedure is:
- open file E1.
- Read entries until you find E2,E3, or you find an entry where the first value is greater than E2.
If you need further speedup, memory-map the file and use binary search to find the E2,E3 entry.
It occurs to me that a 1000-to-2 compression ratio is probably impossible without discarding some data.
Basically what you have is a 8192*8192*8192 histogram. If you divide this cube into buckets of 32*32*32 (thus 256*256*256 buckets), you can apply an FFT or DCT to each of these buckets and then keep only the 4*4*4 cube containing the low frequencies. Then you would have sizeof(u16)*4*4*4*256*256*256 = 2 GiB.
During the initial processing you will still need to keep a 1 TiB matrix in storage, while you're building the histogram. I don't see a way around this.
The questions would be:
1. How quickly can this be processed to perform the projection? Undoing the FFT/DCT in particular would be expensive, but at least it can be parallelized relatively easily.
2. How much accuracy would remain in the data after applying the low-pass filter? I have no clue at all.
if you needed more accuracy for #2 you could use your own wavelet instead of a fixed cos or whatever. But that is a bit of an undertaking in and of itself.