Better(faster) algorithm to compare 2 vector of vector of integer ?

Pages: 12
Hello all,

I have 1 set of raw data file(s), each has 8 millions~9 millions lines (yes,
8,000,000~9,000,000) in the following format,

    1,2,3,4,5,16,23,35
    1,2,3,4,6,17,23,36
    1,2,3,4,7,18,23,37
    1,2,3,4,8,19,23,38
    1,2,3,4,9,20,23,39
    1,2,3,4,10,21,23,40
    1,2,3,4,11,22,23,41
    1,2,3,4,12,23,24,42
    1,2,3,4,13,24,25,43
    1,2,3,4,14,25,26,44

Each line has 8 sorted numbers and range from 1~49.
Another set of "filter" file(s) each has 6 millions ~ 7 millions line in the
following format,

    13,4,7,8,18,20
    9,10,11,12,5,6,7,8,1,2,3,4,21,22,23,24,13,14,15,16,29,30,31,32,45,46,47,48
    29,49,36,37,34,17,15,9,16,30,28,47,46,27,20,32,14,26,1,4,3,6,10,2,7,48,44,41

Each line has 4~28 non sorted numbers and range 1~49
I need to compare each line from "raw data" file with every lines in "filter" file
and get the intersection value, e.g. line 1 in raw with line 1~3 in filter

    1  // since only 4 is in common with filter line 1
    7  // since only 35 not found in filter line 2
    6  // since 5 23 35 not found in filter line 3    

After the comparsion, will output the result according to the threshold value.
e.g.
output raw data line with intersection value >= 2,
output raw data line with intersection value == 4

I knew that there are (at most) 9 millions x 8 millions line comparsions.
At first, I try using set_intersection to do the job but it takes forever to do the
task. (the filter line is sorted before pass to set_intersection)
1
2
3
4
	int res[8];
	int *it = set_intersection(Raw.Data, Raw.Data+8, FilterVal.begin(), FilterVal.end(), res);
	ds = GetIntersect(GDE.DrawRes, LotArr) * 2;
	int IntersectCnt=it-res;

Next, I try build up an array of integer zero:
 
	int ResArr[49] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};

and use 3 helper functions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
	void InitResArr(int * inResArr, vector<int> & FilterVal) {
		for (int i = 0; i < FilterVal.size(); i++) {
			inResArr[FilterVal[i] - 1] = 1;
		}
	}
	void ResetResArr(int * inResArr, vector<int> & FilterVal) {
		for (int i = 0; i < FilterVal.size(); i++) {
			inResArr[FilterVal[i] - 1] = 0;
		}
	}

	int GetIntersect(int * inResArr, int * inRawData) {
		int RtnVal = 0;
		for (int i = 0; i < 8; i++) {
			RtnVal+=inResArr[inRawData[i] - 1];
		}


But this approach still take over 3 hrs to finish 1 comparsion (1 raw data file with 1 filter).
And I have 5,000 raw data files and 40,000 filters to go!!!
Is there any other better approach to handle this task ? Thanks.

Regds

LAM Chi-fung
Last edited on
can you compare line to line with something like memcmp and when it is not equal, you may have to handle that at a finer grain (which one is different etc). But if you can check big blocks as equal and ignore the equal ones (or spew out a default message for equality) it might help.

what do you KNOW? do you know that the data is mostly equal, or mostly not equal? Anything else about the data that might give you a way to bypass some of the work?

have you multi-threaded it? These volumes may be a great place to split it up and parallel process it.

1
2
3
4
5
6
7
8
9
10
11
12
13
// Create 64-bit representation of the values:
uint64_t n = 0;
for (int i = 0; more_values_on_line(); i++)
    n |= 1 << get_next_value();

// Perform intersection with xor:
uint64_t res = n ^ x;

// Count bits that are still set:
int numbits(uint64_t n) {
     unsigned cnt = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
     return ((cnt >> 3 + cnt) & 030707070707) % 63;
}


Also, if you are reading the data more than once you should be able to read it all in and process it entirely in memory (at least any files that you read more than once).
Last edited on
What an interesting problem. Analyzing lottery results?

I was thinking the same thing as tpb. A few comments though:
// Perform intersection with xor:
That should be AND instead of XOR (& instead of ^)

// Count bits that are still set

Below are two very old C functions that count bits in a 32-bit word using clever bit twiddling. They can be extended for a 64 bit value. Most of the time, I'd expect the number of set bits to be small, so probably the second one is best:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/* Routines for counting the number of bits in a 32 byte word */

/*
 * This routine counts bits by dividing the word into 1-bit fields and adding
 * Pairs of adjacent fields together.  This gives 2-bit fields, each containing a
 * count of the number of bits that were originally in those positions.
 * Now you can add adjacent fields together to get 4 bit fields.  Then 8, 16, and
 * finally 32.
 */
unsigned bitcount1(value)
register unsigned value;
{
        register unsigned tmp;

        tmp = value >> 1;
        value &= 0x55555555;
        tmp &= 0x55555555;
        value += tmp;

        /*
         * The word is now broken into fields of two bits each.
         * Each field contains the number of bits that were set
         */

        tmp = value >> 2;
        value &= 0x33333333;
        tmp &= 0x33333333;
        value += tmp;                           /* now each field is 4 bits wide */

        tmp = value >> 4;
        value &= 0x0f0f0f0f;
        tmp &= 0x0f0f0f0f;
        value += tmp;                           /* now 8 */
   
        tmp = value >> 8;
        value &= 0x00ff00ff;
        tmp &= 0x00ff00ff;
        value += tmp;                           /* now 16 */

        tmp = value >> 16
        value &= 0x0000ffff;
        tmp &= 0x0000ffff;
        return value + tmp;                     /* finally one field 32 bits wide */
}


/*
 * This version is good when you expect few (if any) bits to be set.  It takes
 * advantage of a two's complement trick.  Consider:
 *              x & (x-1)
 * Where x has at least one bit set.  This expression is x with the LSB turned off.
 * This algorithm simply increments a counter while repeatedly turning off the LSB.
 */
unsigned
bitcount2(value)
register unsigned value;
{
        register unsigned count;

        for (count=0; value; value &= value - 1) {
                ++count;
        }
        return count;
}

@dhayden, I wasn't thinking when I said xor; of course it's and! :-)
And that bit counting routine I posted doesn't seem to work, even if you extend the constants to 64-bits.
So I'm using dhayden's bitcount2 (updated and extended to 64-bits).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <vector>
#include <cstdint>
using namespace std;

vector<uint64_t> read_file(const char *filename) {
    vector<uint64_t> v;
    ifstream fin(filename);
    string line;
    while (getline(fin, line)) {
        istringstream sin(line);
        int value = 0;
        char comma;
        uint64_t n = 0;
        while (sin >> value) {
            n |= uint64_t(1) << value;
            sin >> comma;
        }
        v.push_back(n);
    }
    return v;
}

int numbits(uint64_t value) {
    int cnt = 0;
    for ( ; value; value &= value - 1) ++cnt;
    return cnt;
}

void prn(uint64_t u) { // printing from lowest-order bit to highest
    for (uint64_t mask = 1; mask; mask <<= 1) cout << !!(u & mask);
}

void prn(vector<uint64_t>& v) {
    for (int i=0;i<6;i++) for (int j=0;j<10;j++) cout<<i; cout<<"6666\n";
    for (int i=0;i<6;i++) for (int j=0;j<10;j++) cout<<j; cout<<"0123\n";
    for (uint64_t u: v) { prn(u); cout << " : " << setw(2) << numbits(u) << '\n'; }
}

int main() {
    vector<uint64_t> v = read_file("raw01.txt");
    v.push_back(v[v.size()-2] & v.back());  // intersection of last two lines
    prn(v);    
}



Input file:
1,2,3,4,5,6,7,8
21,22,23,24,25,26,27,28
2,4,6,8,10,12,14,16
5,10,15,20,25,30,35,40
3, 7, 16, 18, 21, 23, 25, 38
23,43,25,17,9,3,22,14,22,41,31,39,16

Output:
0000000000111111111122222222223333333333444444444455555555556666
0123456789012345678901234567890123456789012345678901234567890123
0111111110000000000000000000000000000000000000000000000000000000 :  8
0000000000000000000001111111100000000000000000000000000000000000 :  8
0010101010101010100000000000000000000000000000000000000000000000 :  8
0000010000100001000010000100001000010000100000000000000000000000 :  8
0001000100000000101001010100000000000010000000000000000000000000 :  8
0001000001000010110000110100000100000001010100000000000000000000 : 12
0001000000000000100000010100000000000000000000000000000000000000 :  4


EDIT: Here's a working alternate bitcnt:
1
2
3
4
5
uint64_t numbits(uint64_t n) {
     n = n - ((n >> 1) & 0x5555555555555555);
     n = (n & 0x3333333333333333) + ((n >> 2) & 0x3333333333333333);
     return (((n + (n >> 4)) & 0x0F0F0F0F0F0F0F0F) * 0x0101010101010101) >> 56;
}

Last edited on
¿do you need the results line by line? ¿or do you care about a global value? (like `n' lines are above the threshold)

you have O(n^2) without considering the intersection part. that seems too much for your limits
that seems too much for your limits

I think you're right. Taking somewhat smaller values:
Comparing 5 million objects against 5 million other objects in every pairing is 5 million squared, or 25 trillion, comparisons. If each comparison takes a millionth of a second the run would take nine-and-a-half months. If each comparison takes a billionth of a second it will run for about 7 hours.
@jonnin,@tpb,@dhayden,@ne555

Thanks for your help, I am now modifing my program.
@dhayden, this is not lottery analysis, its part of a big project.
Our client try to build up a "fortune telling" database from 2 Chinese fortune telling methods:
"date of birth and eight characters of a horoscope" and "Purple Star Astrology"
usually "date of birth and eight characters of a horoscope" will transform one's birthday
by using "the decimal cycle" and "the duodecimal cycle" and should have repeated values.
But our client use special method to encode birthdat into 8 distinct integers (the raw data,1~49)
and encode the "Purple Star Astrology" into 4~28 integer (the filter file). I am not response
for the encoding and my task is to compare and matching (find the difference) of 2 sets of data.
The client is very ambitious and want to combine these 2 methods ~_~
Anyway, this is their issue and I only need to generate the compare results.

But you inspire me, may be I can make use of this on Lottery analysis ^_^ Any idea ?
If each comparison takes a millionth of a second the run would take nine-and-a-half months. If each comparison takes a billionth of a second it will run for about 7 hours.

Keep in mind that at 4GHz, a single core executes up to 4 trillion instructions per second [ edit: where's my brain? That's 4 billion, not trillion. Thanks tpb]. Computers are mind-numbingly fast. If it takes a billionth of a second to do a comparison, then code is running slow.
And I have 5,000 raw data files and 40,000 filters to go!!!

Do you need to compare each line in each raw file against all 40,000 filter files? Or is it more like "each raw file has 8 filter files that it must compare against"? Please be as specific as possible. I have an idea for another algorithm that might be faster if the numbers are big enough.

If there are 5,000 raw files with 8-9 million entries then there may be duplicates. There are only about 377 million ways to select 8 numbers from 1-48.

Last edited on
dhayden wrote:
Keep in mind that at 4GHz, a single core executes up to 4 trillion instructions per second.

4GHz means 4 billion clock ticks per second ... and you're saying it can execute 1000 times that many instructions in a second?

4 billion instructions per second. at least 4 cores, maybe 8. each core can potentially do 2 stupid simple things at once, via the 2 pipes. So an uber tweaked program might manage 4*8*2, 64 billion? A cuda enabled machine can do so many more than this it is mind blowing, but a straight cpu only program is doing pretty well by itself. But that kind of numbers is going to mean no jumps, no cache misses/page faults, just pure assembly iteration of simple single cycle instructions.

The only way its going to get into the trillions is with cuda or Beowulf type setup or a server with a bunch more processors etc. Even a good desktop, not so much? A single core is not even in the running....
Last edited on
@dhayden

And I have 5,000 raw data files and 40,000 filters to go!!!

Do you need to compare each line in each raw file against all 40,000 filter files? Or is it more like "each raw file has 8 filter files that it must compare against"? Please be as specific as possible. I have an idea for another algorithm that might be faster if the numbers are big enough.

Yes, I need to compare each line in each raw files with each line in the corresponding set of filter files (yes, 1 raw file need to compare with 8 filter files).
Let me get this straight.

You have 5000 raw data files of about 8 million lines each.

Each one of these raw data files needs to be compared against 8 filter files, where the filters have about 6 million lines each.

Is that right?

You do realize that the problem being discussed right now is whether your computational task is even possible. Do you have access to any of the high-speed computing machinery (and the knowledge to program it) as mentioned by jonnin in his last post?

I vastly underestimated the task in my last attempt. It's more like:

8 million lines * 6 million lines * 8 filters * 5000 data files

8e6 * 6e6 * 8 * 5e3 = 1.9e18 = almost 2 billion billion comparisons.

If you could do a trillion comparisons per second it would take about 23 days.
Last edited on
Our client try to build up a "fortune telling" database from 2 Chinese fortune telling methods:
"date of birth and eight characters of a horoscope" and "Purple Star Astrology"
usually "date of birth and eight characters of a horoscope" ...
... encode the "Purple Star Astrology"

C++ has just the library for you:
http://www.cplusplus.com/reference/random/


But if you still want to go ahead ...

MPI + 128 cores: maybe 5 hours if @tpb's analysis is correct. You'll need access to a suitable cluster, but most universities will have one.
https://www.mpich.org/
Last edited on
All that means is that you can't realistically do this with brute force. If I can find some time, I'll try to work up a better algorithm.
@dhayden, @lastchance, @tpb

I only have access to 5~7 quad core i7 desktop... (no super computer) seems that there is no simple method to tackle this task >_<
CFLam, I"m still, not clear on exactly what the output should be. Let's take that first line in your example:
1,2,3,4,5,16,23,35

Suppose your threshold value is ==7. If this line has exactly 7 numbers in common with 100 lines in the filter file then what do you output? One copy of the raw line? 100 copies of the raw line? 100 lines, each with the raw line and the filter line that it matches?

I'm asking because subtle differences like this can be exploited to create a faster algorithm.

Also, you gave some info on where the data comes from. Is the data random or is there some mathematical formula that generates it? If there's a math formula then it may be possible to compute the answer mathematically without doing comparisons at all.
@dhayden

If the raw line contain number >= the threshold value (ie >= number in 100 lines of filter file), this line in raw file will be marked accepted and output. Only 1 copy of raw file will be output.

The data (raw file) come from transform birthday of selected range and I don't know the formula (client keep secret on it) ~_~
can you get it though? Even if you don't KNOW what it is doing, if you had a black-box library / API type call you could still exploit it (its better to understand it, but just being able to compute it can be useful).

@jonnin

Sigh~ Not in my case, I can't have this ~_~
Pages: 12