Better(faster) algorithm to compare 2 vector of vector of integer ?

Pages: 12
Each line has 8 sorted numbers and range from 1~49.
Another set of "filter" file(s) each has 6 millions ~ 7 millions line in the
following format,

13,4,7,8,18,20
9,10,11,12,5,6,7,8,1,2,3,4,21,22,23,24,13,14,15,16,29,30,31,32,45,46,47,48
29,49,36,37,34,17,15,9,16,30,28,47,46,27,20,32,14,26,1,4,3,6,10,2,7,48,44,41

ok, what if you build up a giant cache (and probably write it to disk when you're done to not lose it ;D)

vector<array<bool,50>> filters;

read each line and perform a frequency scan on the integers found (or use bits as suggested, but this might be easier to write initially)

just ignore index 0 of the array

For first filter line, we'd set
1
2
3
4
5
6
filters[0][13] = true;
filters[0][4] = true;
filters[0][7] = true;
filters[0][8] = true;
filters[0][18] = true;
filters[0][20] = true;


etc. as big as the filter file. Can keep this in memory and/or serialize it; we're done with the file itself.

Now the raw input file
for each line (~8mil)
    for each integer (8)
        for each filter (~7million)
            tally up score against the filter index; record (line, filter_line) pairs somewhere  , (if score is >=2 etc.)


You could launch like 8 workers to do 1 million lines each of the input file (outer loop) for example. I suspect something like this can be done in <24hrs since it's 2018 ;D
You could take a similar approach to generate the cache (each thread has a mutually exclusive section to write to).

Last edited on
@icy1
Good idea, thanks, will find time to implement it, see if this help my case ^_^
Topic archived. No new replies allowed.
Pages: 12