Generate unique combinations from multiple arrays/vectors

Hello all,

I have 800 data files and each file contain 8 lines of integer eg
17,1,2,3,4,5,6,7,10,11,12,13,15,16,20,22,24,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16,1,2,3,4,5,6,7,8,9,10,11,12,16,17,21,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
23,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,23,25,26,28,29,35,36,,,,,,,,,,,,,,,,,,,,,,,,,,
27,8,9,11,12,13,14,15,17,19,20,21,22,23,24,26,27,28,29,30,31,32,34,37,39,40,41,42,,,,,,,,,,,,,,,,,,,,,,
27,14,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,35,36,37,38,39,40,42,43,44,,,,,,,,,,,,,,,,,,,,,,
24,20,24,26,27,28,29,30,31,32,33,34,35,36,37,39,40,41,42,43,44,45,46,47,48,,,,,,,,,,,,,,,,,,,,,,,,,
16,33,34,35,36,37,38,39,41,42,43,44,45,46,47,48,49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
14,35,37,38,39,40,41,42,43,44,45,46,47,48,49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Each line has 50 elements, 1st element of each line is number count i.e. 17 of line 1 indicate there is 17 numbers in this line 1,2,3,4,5,6,7,10,11,12,13,15,16,20,22,24,26. Numbers in each line is unique , in ascending order and within range 1~49.
My task is to generate list of unique 8 numbers combinations from this 8 lines
i.e. A,B,C,D,E,F,G,H
A from line 1, B from line 2 ... H from line 8

24,517,914,624 (17*16*23*27*27*24*16*14) entries will be generated:
1,1,4,8,14,20,33,35
...
1,1,4,8,14,20,33,49
....
1,2,4,8,14,20,33,35
...
2,1,4,8,14,20,33,35
...

And then process the 24,517,914,624 entries list as follow:
i) remove entries with duplicate numbers e.g. 1,1,4,8,14,20,33,35 and 1,1,4,8,14,20,33,49 will be removed
ii) sort number in each entry in ascending order e.g. 2,1,4,8,14,20,33,35 will become 1,2,4,8,14,20,33,35
iii) remove duplicated entries e.g. 2,1,4,8,14,20,33,35 is same as 1,2,4,8,14,20,33,35 after sorted, therefore only 1 entry of 1,2,4,8,14,20,33,35 will be kept
After the above process, may be around 10 millions entries left (which is the result I want)

However. processing a 24,517,914,624 entries array is a nearly impossible task,
therefore I tried the following 2 approachs to tackle the problem (try remove entries with duplicate numbers and sort number for each entry.

1) Brute force approach, use 8 nested for loop to generate combinations:
for (int i = 0; i < LineArr[0][0]; i++) {
for (int j = 0; j < LineArr[1][0]; j++) {
for (int k = 0; k < LineArr[2][0]; k++) {
for (int l = 0; l < LineArr[3][0]; l++) {
for (int m = 0; m < LineArr[4][0]; m++) {
for (int n = 0; n < LineArr[5][0]; n++) {
for (int o = 0; o < LineArr[6][0]; o++) {
for (int p = 0; p < LineArr[7][0]; p++) {
MyRes[0]=LineArr[0][i]
MyRes[1]=LineArr[1][j]
MyRes[2]=LineArr[2][k]
MyRes[3]=LineArr[3][l]
MyRes[4]=LineArr[4][m]
MyRes[5]=LineArr[5][n]
MyRes[6]=LineArr[6][o]
MyRes[7]=LineArr[7][p]
// Sort number of MyRes and discard if it contains duplicate numbers
// store valid combination in a temp array/vector
}}}}}}}}
// remove duplicate entries in the temp array/vector ('unique' the temp array)

2) Stepwise approach
Instead of generate 8 numbers combination at once, generate 2 numbers combination from first 2 lines, sort number in each entry, remove entry with duplicate number and unify the list
the output will be something like this:
1,2
1,3
1,4

1,1 2,2 will be removed and 4,1 will become 1,4 and duplicated entries removed.

Then the above list will combine with line 3 to form 3 numbers combinations, also sort and remove entries with duplicated number and unify the list.
Apply the above to 4,5,6...8 lines to form 4,5,6...8 numbers combinations

Since this is part of an automation project, AutoIt is used throughout the project (those 800 files
are from another 3rd party software). I tried implement the combinations generation with AutoIt,
Technically approach 1) generate 24,517,914,624 entries, sort number in each entry right after generation and discard entry with duplicate number in it.
This approach takes forever to run since it involve billions entries to test/sort and its array size is much higher than AutoIt's array size limit (16 millions). Therefore approach 1) can be discarded,
it only suitable for (at most) 5 numbers combination (eg 1,3,7,14,23).

For approach 2), I tried 2 variations:
i) store the result in each step in temp array and use AutoIt's
_ArrayUnique function to remove duplicate entries. This also takes forever to run!!
ii) Instead of store the result in temp array, I make use of SQLite, i.e. put the combination
generated in each step into a single row table in SQLite, the table/row is created with PRIMARY KEY UNIQUE
Then I select the row back into AutoIt for further processing.
Variation ii) eventually work, it takes 1 hr 20 min to handle 1 file (and I have 800 of such files)

Now I plan to implement the combination generation in VC++ (VS 2017) and I have the follow questions:
1) Apart from "Brute force" and "Stepwise", any other approach/algorithm to generate unique combinations
from multiple arrays/vectors from performance point of view ?
2) To sort number in each entry and check repeat number in each entry, I think std::sort, std::search/std::find will do the job, however, since there will be millions entries to check, is there any other options from performance point of view ?
3) To remove duplicate entries (unify the combination list i.e. get unique combinations), I should use std::unique or still rely on SQLite ? since the size of array may as large as 30~40 millions and shrink to 10 millions after std::sort and std::unique or SELECT from SQLite (don't know which implementation is better in performance point of view)
4) Any ready made LIB can easy the task ?

Thanks a lot.

Regds
LAM Chi-fung
Last edited on
Please could you clarify:

Do you want
(1)
The smallest element in the sequence to come from row 1
The next smallest element to come from row 2
...


OR


(2)
AN element from row 1
AN element from 2
...
with these sequences being created and then sorted.


The following will do the latter on your test case (which isn't interesting, because it's easy to find the requisite 8 sequences).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <algorithm>
using namespace std;

//=====================================================================

template<typename T> void print( const vector<T> &V )
{
   for ( auto e : V ) cout << e << "  ";
   cout << '\n';
}

//=====================================================================

vector< vector<int> > readFile( ifstream &strm )
{
   vector< vector<int> > A;
   string line;
   while ( getline( strm, line ) )
   {
      stringstream ss( line );
      int counter;
      char c;
      ss >> counter;
      vector<int> row( counter );
      for ( int i = 0; i < counter; i++ ) ss >> c >> row[i];
      A.push_back( row );
   }
   return A;
}

//=====================================================================

vector< vector<int> > getUnique( const vector< vector<int> > &A, int N )
{
   vector< vector<int> > result;              // This will hold the unique sequences
   int rows = A.size();
   vector<int> current( rows );               // This will hold a sequence as it is being built
   vector<int> indices( rows );               // This is a parallel array holding the indices in current
   int search = 0;                            // The row to search in
   int startIndex = 0;                        // The index to start in that row

   while ( result.size() < N )
   {
      bool added = false;                                       // Can we add an element of this row to the sequence
      for ( int i = startIndex; i < A[search].size(); i++ )     // Check the remaining elements of this row
      {
         if ( find( current.begin(), current.begin() + search, A[search][i] ) == current.begin() + search )
         {
            indices[search] = i;
            current[search] = A[search][i];
            added = true;
            break;
         }
      }
      if ( added )                            // If we were able to add a new element ...
      {
         search++;                            // ... go to the next row
         startIndex = 0;                      //
      }
      else                                    // Otherwise ...
      {
         search--;                            // ... have to back up
         if ( search < 0 )
         {
            cout << "Unable to find requisite number of sequences\n";
            return result;
         }
         startIndex = indices[search] + 1;    // ... and start looking at the next index
      }
      if ( search == rows )                   // If we have filled the sequence ...
      {
         result.push_back( current );         // Add to collection
         search--;                            // Go to the last row
         startIndex = indices[search] + 1;    // ... and start searching at the next position
      }
   }

   return result;
}

//=====================================================================

int main()
{
   const int Nsequences = 8;

   ifstream in( "data.txt" );
   vector< vector<int> > A = readFile( in );
   cout << "Number of rows read = " << A.size() << '\n';

   // Get the requisite number of sequences
   vector< vector<int> > result = getUnique( A, Nsequences );

   // Sort each sequence and print it
   for ( auto &row : result )
   {
      sort( row.begin(), row.end() );
      print( row );
   }
}


data.txt
17,1,2,3,4,5,6,7,10,11,12,13,15,16,20,22,24,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16,1,2,3,4,5,6,7,8,9,10,11,12,16,17,21,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
23,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,23,25,26,28,29,35,36,,,,,,,,,,,,,,,,,,,,,,,,,,
27,8,9,11,12,13,14,15,17,19,20,21,22,23,24,26,27,28,29,30,31,32,34,37,39,40,41,42,,,,,,,,,,,,,,,,,,,,,,
27,14,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,35,36,37,38,39,40,42,43,44,,,,,,,,,,,,,,,,,,,,,,
24,20,24,26,27,28,29,30,31,32,33,34,35,36,37,39,40,41,42,43,44,45,46,47,48,,,,,,,,,,,,,,,,,,,,,,,,,
16,33,34,35,36,37,38,39,41,42,43,44,45,46,47,48,49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
14,35,37,38,39,40,41,42,43,44,45,46,47,48,49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Number of rows read = 8
1  2  4  8  14  20  33  35  
1  2  4  8  14  20  33  37  
1  2  4  8  14  20  33  38  
1  2  4  8  14  20  33  39  
1  2  4  8  14  20  33  40  
1  2  4  8  14  20  33  41  
1  2  4  8  14  20  33  42  
1  2  4  8  14  20  33  43 

Thanks for your reply.
Should be case 2)
My task can be interept as below. For the below example
17,1,2,3,4,5,6,7,10,11,12,13,15,16,20,22,24,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16,1,2,3,4,5,6,7,8,9,10,11,12,16,17,21,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
23,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,23,25,26,28,29,35,36,,,,,,,,,,,,,,,,,,,,,,,,,,
27,8,9,11,12,13,14,15,17,19,20,21,22,23,24,26,27,28,29,30,31,32,34,37,39,40,41,42,,,,,,,,,,,,,,,,,,,,,,
27,14,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,35,36,37,38,39,40,42,43,44,,,,,,,,,,,,,,,,,,,,,,
24,20,24,26,27,28,29,30,31,32,33,34,35,36,37,39,40,41,42,43,44,45,46,47,48,,,,,,,,,,,,,,,,,,,,,,,,,
16,33,34,35,36,37,38,39,41,42,43,44,45,46,47,48,49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
14,35,37,38,39,40,41,42,43,44,45,46,47,48,49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

24,517,914,624 (17*16*23*27*27*24*16*14) entries will be generated:
1,1,4,8,14,20,33,35
...
1,1,4,8,14,20,33,49
....
1,2,4,8,14,20,33,35
...
2,1,4,8,14,20,33,35
...

And then process the 24,517,914,624 entries list as follow:
i) remove entries with duplicate numbers e.g. 1,1,4,8,14,20,33,35 and 1,1,4,8,14,20,33,49
ii) sort number in each entry in ascending order e.g. 2,1,4,8,14,20,33,35 will become 1,2,4,8,14,20,33,35
iii) remove duplicated entries e.g. 2,1,4,8,14,20,33,35 is same as 1,2,4,8,14,20,33,35 after sorted, therefore only 1 1,2,4,8,14,20,33,35 will be kept

After the above process, may be around 10 millions entries left (which is the result I want)

However. processing a 24,517,914,624 entries array is a nearly impossible task, therefore I tried to remove entries with duplicate numbers, sort each entry or even take the "stepwise approach" to reduce the array size during generating it.
Last edited on
Sorry, but I still don't follow what you are asking.

Are
1,4,8,14,20,33,35,38
and
14,20,33,35,38,40,43,45
allowed as separate returns?

They have some numbers "in common", but they aren't the same sequence.

Actually, if all your numbers are 0-49 then you won't get 8 completely-non-intersecting sequences of 8 numbers.



It is not sensible to generate all possible sequences. Just keep constructing them until you find the requisite number of unique ones. There are considerably more than 8 in your example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <algorithm>
using namespace std;

//=====================================================================

template<typename T> void print( const vector<T> &V )
{
   for ( auto e : V ) cout << e << "  ";
   cout << '\n';
}

//=====================================================================

vector< vector<int> > readFile( ifstream &strm )
{
   vector< vector<int> > A;
   string line;
   while ( getline( strm, line ) )
   {
      stringstream ss( line );
      int counter;
      char c;
      ss >> counter;
      vector<int> row( counter );
      for ( int i = 0; i < counter; i++ ) ss >> c >> row[i];
      A.push_back( row );
   }
   return A;
}

//=====================================================================

vector< vector<int> > getUnique( const vector< vector<int> > &A, int N )
{
   vector< vector<int> > result;              // This will hold the unique sequences
   int rows = A.size();
   vector<int> current( rows );               // This will hold a sequence as it is being built
   vector<int> indices( rows );               // This is a parallel array holding the indices in current
   int search = 0;                            // The row to search in
   int startIndex = 0;                        // The index to start in that row

   while ( result.size() < N )
   {
      bool added = false;                                       // Can we add an element of this row to the sequence
      for ( int i = startIndex; i < A[search].size(); i++ )     // Check the remaining elements of this row
      {
         if ( find( current.begin(), current.begin() + search, A[search][i] ) == current.begin() + search )
         {
            indices[search] = i;
            current[search] = A[search][i];
            added = true;
            break;
         }
      }
      if ( added )                                              // If we were able to add a new element ...
      {
         search++;                                              // ... go to the next row
         startIndex = 0;                                        //
      }
      else                                                      // Otherwise ...
      {
         search--;                                              // ... have to back up
         if ( search < 0 )
         {
            cout << "Unable to find requisite number of sequences\n";
            return result;
         }
         startIndex = indices[search] + 1;                      // ... and start looking at the next index
      }
      if ( search == rows )                                     // If we have filled the sequence ...
      {
         sort( current.begin(), current.end() );                // Sort the result before comparing with previous
         if ( find( result.begin(), result.end(), current ) == result.end() )
            result.push_back( current );                        // If not already there, then add to collection
         search--;                                              // Go to the last row
         startIndex = indices[search] + 1;                      // ... and start searching at the next position
      }
   }

   return result;
}

//=====================================================================

int main()
{
   const int Nsequences = 8;

   ifstream in( "data.txt" );
   vector< vector<int> > A = readFile( in );
   cout << "Number of rows read = " << A.size() << '\n';

   // Get the requisite number of sequences
   vector< vector<int> > result = getUnique( A, Nsequences );
   for ( auto &row : result ) print( row );
}

Last edited on
Hello,

> Are
> 1,4,8,14,20,33,35,38
> and
> 14,20,33,35,38,40,43,45
> allowed as separate returns?

Yes, they are valid entries.

> It is not sensible to generate all possible sequences. Just keep constructing them
> until you find the requisite number of unique ones. There are considerably more
> than 8 in your example.
Yes, for this example, may be around 10 millions 8 numbers combination left (this is the result I want).
Last edited on
lastchance's program should be close without the Nsequences limit.

std::set I can't tell how efficient it is, but sets are ordered and unique and the check whether set contains value is logarithmic. In other words, sorting is automatic byproduct of insertion.

However, the sequences have at most 8 elements. "Logarithmic" means less.

within range 1~49

One byte is enough to store one value? char A sequence could be a word with 8 (nonprintable) characters. std::string rather than std::vector<int>.
1
2
3
4
5
void print( const string& W )
{
   for ( auto c : W ) cout << static_cast<int>(c) << '  ';
   cout << '\n';
}



You do expect to get millions of unique sequences. Hence
std::set<std::string> getUnique( const std::vector<std::string> & );
Looks like I misinterpreted the question. I thought the OP just wanted 8 sequences, not all of them!
Well, I managed to get up to and including row 6 with a sort of stepwise approach ... then I ran out of memory! This example seems to be heading for a lot more than 10 million sequences. (Or maybe my coding is wrong).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <set>
#include <algorithm>
using namespace std;

//=====================================================================

template<typename T> void print( const vector<T> &V )
{
   for ( auto e : V ) cout << e << "  ";
   cout << '\n';
}

//=====================================================================

string encode( const vector<int> &V )
{
   int N = V.size();
   string text( N, ' ' );
   for ( int i = 0; i < N; i++ ) text[i] = (char)( V[i] );
   return text;
}

//=====================================================================


vector<int> decode( const string &text )
{
   int N = text.size();
   vector<int> V( N );
   for ( int i = 0; i < N; i++ ) V[i] = (int)( text[i] );
   return V;
}

//=====================================================================

vector< set<string> > readFile( ifstream &strm )           // Deliberate: all strings are just one char
{
   vector< set<string> > result;
   string line;
   int counter, num;
   char c;

   while ( getline( strm, line ) )
   {
      stringstream ss( line );
      ss >> counter;
      set<string> row;
      for ( int i = 0; i < counter; i++ ) 
      {
         ss >> c >> num;
         row.insert( string( 1, (char)num ) );           // Yes, peculiar ... but watch this space
      }
      result.push_back( row );
   }
   return result;
}

//=====================================================================

set<string> combine( const set<string> &A, const set<string> &B )
{
   set<string> result;

   for ( string sa : A )
   {
      for ( string sb : B )
      {
         if ( sb.find_first_of( sa ) == string::npos ) 
         {
            string temp = sa + sb;
            sort( temp.begin(), temp.end() );
            result.insert( temp );
         }
      }
   }
   cout << "Reported size: " << result.size() << '\n';

   return result;
}

//=====================================================================

int main()
{
   ifstream in( "data.txt" );
   vector< set<string> > A = readFile( in );
// sort( A.begin(), A.end(), []( auto a, auto b ){ return a.size() > b.size(); } );      // Nope, didn't help
   cout << "Number of rows read = " << A.size() << '\n';


   // Divide and conquer (nope, this didn't work - not enough memory)
// cout << "Level 1 ...\n";
// set<string> A01 = combine( A[0], A[1] );
// set<string> A23 = combine( A[2], A[3] );
// set<string> A45 = combine( A[4], A[5] );
// set<string> A67 = combine( A[6], A[7] );
//
// cout << "Level 2 ...\n";
// set<string> A0123 = combine( A01, A23 );
// set<string> A4567 = combine( A45, A67 );
//
// cout << "Level 3 ...\n";
// set<string> result = combine( A0123, A4567 );


   // Stepwise
   set<string> result = A[0];
   int row;
// for ( row = 2; row <= A.size(); row++ )                 // Sadly, can't get this far due to memory limits
   for ( row = 2; row <= 6; row++ )
   {
      cout << "\nCurrently including row " << row << " ...\n";
      set<string> work = result;
      result.clear();
      result = combine( A[row-1], work );                  // Deliberate row - 1
   }
   row--;


   // You will need to call decode() to print them
   int N = result.size();
   cout << "\nUp to and including row " << row << "(counting from 1), there are " << N << " possible sequences\n";

   // Print the first 10
   int num = min( N, 10 );
   cout << "The first " << num << " are:\n";
   for ( string s : result )
   {
      print( decode( s ) );                                // Have to call decode to get the integers
      if ( --num == 0 ) break;
   }
}


Number of rows read = 8

Currently including row 2 ...
Reported size: 194

Currently including row 3 ...
Reported size: 3016

Currently including row 4 ...
Reported size: 53293

Currently including row 5 ...
Reported size: 721922

Currently including row 6 ...
Reported size: 8490614

Up to and including row 6(counting from 1), there are 8490614 possible sequences
The first 10 are:
1  2  4  8  14  20  
1  2  4  8  14  24  
1  2  4  8  14  26  
1  2  4  8  14  27  
1  2  4  8  14  28  
1  2  4  8  14  29  
1  2  4  8  14  30  
1  2  4  8  14  31  
1  2  4  8  14  32  
1  2  4  8  14  33  


Hello lastchance, it works, thats what I want.
Modify a little bit:
result = combine(A[row - 1], work);
-> work.clear();
}
to free up some memory. For the memory issue, will consider add more ram and compile in X64 mode. For up to 6 rows, (when compile in release version), it takes only ~10 seconds to complete!
Last edited on
Hello keskiverto, thanks for inspire me on std::set.
Apart from free up memory, add more ram and compile in X64 mode, I find that there is a sure work method: make use of SQLite, i.e. discard the std::set after 6 row and store into a SQLite table~
I managed to get to 7 rows using 64-bit (unsigned long long) storage of sequences. I'll revisit the divide-and-conquer approach, rather than the stepwise one.

However, it does look like there are considerably more than 10 million legitimate sequences.
> However, it does look like there are considerably more than 10 million legitimate
> sequences.
Yes.... depend on how many repeat number ~_~
How to get std::set to use 64 bit unsigned long long ???
Am I right that the "worst case" is that each row has all the 49 values?

You have 49 choices for the first.
You have 48 choices for the second, for one had been taken.
You have 47 choices for the third, for two had been taken.

That makes 42*43*..*49 sequences. However, since each row can supply any value, each set of 8 values has 8! enumerations, don't they?

That would leave 42*43*..*49/8! unique sequences.
6.  13 983 816
7.  85 900 584
8. 450 978 066

Almost 451 million.

8 bytes per sequence that takes 3 607 824 528 bytes for raw data, or 3.4 GB.
However, the set/vector/string have memory overhead. Therefore, the use of disk storage (e.g. SQLite) might be unavoidable.
> Am I right that the "worst case" is that each row has all the 49 values?
Yes, and this reduce to nCr (49 C 8)

> 8 bytes per sequence that takes 3 607 824 528 bytes for raw data, or 3.4 GB.
> However, the set/vector/string have memory overhead. Therefore, the use of disk > storage (e.g. SQLite) might be unavoidable.
Right, thats why I think SQLite may be necessary for this case (may be divide the problem into 2 parts, for row 1~6, work with std::set and store result row 7~8 in SQLite)
This was as far as I could get @CFLam. I jammed the integer sequences into the 64 bits (8 bytes) of an unsigned long long.

I managed to get through 7 rows - but that had already given me over 64 million sequences. Since I sorted the input arrays to deal with the largest first (seeking to rid as many duplicates as possible early on) there is only the smallest to deal with. But I don't have enough memory, I think.

I don't know, but I suspect each entry into std::set will probably require at least another 8 bytes for a couple of pointers, so here that's quite a significant overhead.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <set>
#include <algorithm>
using namespace std;

using INT = unsigned long long;

//=====================================================================

template<typename T> void print( const vector<T> &V )
{
   for ( auto e : V ) cout << e << "  ";
   cout << '\n';
}

//=====================================================================

INT encode( const vector<int> &V )
{
   int N = V.size();
   INT bits = 0ULL;
   for ( int i = 0; i < N; i++ ) bits |= ( 1ULL << V[i] );           // Set the V[i] th bit
   return bits;
}

//=====================================================================

vector<int> decode( INT bits )
{
   vector<int> V;
   for ( int i = 1; i < 49; i++ )
   {
      if ( ( bits & ( 1ULL << i ) ) != 0 ) V.push_back( i );         // If the ith bit is set, add to vector
   }
   return V;
}

//=====================================================================

vector< set<INT> > readFile( ifstream &strm )
{
   vector< set<INT> > result;
   string line;
   int counter, num;
   char c;

   while ( getline( strm, line ) )
   {
      stringstream ss( line );
      ss >> counter;
      set<INT> row;
      for ( int i = 0; i < counter; i++ ) 
      {
         ss >> c >> num;
         row.insert( 1ULL << num );
      }
      result.push_back( row );
   }
   return result;
}

//=====================================================================

set<INT> combine( set<INT> &A, set<INT> &B )
{
   set<INT> result;

   for ( INT a : A )
   {
      for ( INT b : B )
      {
         if ( ( a & b ) == 0 ) result.insert( a | b );
      }
      A.erase( a );                    // Desperate attempt to free some memory ( A should be the larger array)
   }
   cout << "Reported size: " << result.size() << '\n';

   return result;
}

//=====================================================================

int main()
{
   ifstream in( "data.txt" );
   vector< set<INT> > A = readFile( in );
   sort( A.begin(), A.end(), []( auto a, auto b ){ return a.size() > b.size(); } );      // tries to remove most duplicates first
   cout << "Number of rows read = " << A.size() << '\n';

   set<INT> result;
   result.insert( 0ULL );
   int row;
// for ( row = 1; row <= A.size(); row++ )                 // Can't quite get here
   for ( row = 1; row <= 7       ; row++ )
   {
      cout << "\nCurrently including row " << row << " ...\n";
      set<INT> work = result;
      result.clear();
      result = combine( work, A[row-1] );                  // Deliberate row - 1
   }
   row--;


   // You will need to call decode() to print them
   int N = result.size();
   cout << "\nUp to and including row " << row << " (counting from 1), there are " << N << " possible sequences\n";

   // Print the first 10
   int num = min( N, 10 );
   cout << num << " of them are:\n";
   for ( INT bits : result )
   {
      print( decode( bits ) );                             // Have to call decode to get the integers
      if ( --num == 0 ) break;
   }
}


Number of rows read = 8

Currently including row 1 ...
Reported size: 27

Currently including row 2 ...
Reported size: 558

Currently including row 3 ...
Reported size: 8215

Currently including row 4 ...
Reported size: 117888

Currently including row 5 ...
Reported size: 1254015

Currently including row 6 ...
Reported size: 8490614

Currently including row 7 ...
Reported size: 64099944

Up to and including row 7 (counting from 1), there are 64099944 possible sequences
10 of them are:
1  2  4  8  14  20  33  
1  3  4  8  14  20  33  
2  3  4  8  14  20  33  
1  2  5  8  14  20  33  
1  3  5  8  14  20  33  
2  3  5  8  14  20  33  
1  4  5  8  14  20  33  
2  4  5  8  14  20  33  
3  4  5  8  14  20  33  
1  2  6  8  14  20  33 
Well, @CFLam, finally got there ... but only by hammering the hard disk. Couldn't fit the sequences file in memory.

It took about 20 minutes on my work PC (although that was actually running several work jobs at the same time).

I hope I'm not double-counting somewhere. It came in at just over 316 million unique sequences and gave a results file of 2.5GB (ouch!). You still have to decode the sequences because I stored them in 64-bit ints.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <set>
#include <algorithm>
using namespace std;

using INT = unsigned long long;

//=====================================================================

template<typename T> void print( const vector<T> &V )
{
   for ( auto e : V ) cout << e << "  ";
   cout << endl;
}

//=====================================================================

INT encode( const vector<int> &V )
{
   int N = V.size();
   INT bits = 0ULL;
   for ( int i = 0; i < N; i++ ) bits |= ( 1ULL << V[i] );           // Set the V[i] th bit
   return bits;
}

//=====================================================================

vector<int> decode( INT bits )
{
   vector<int> V;
   for ( int i = 1; i < 49; i++ )
   {
      if ( ( bits & ( 1ULL << i ) ) != 0 ) V.push_back( i );         // If the ith bit is set, add to vector
   }
   return V;
}

//=====================================================================

vector< vector<INT> > readFile( ifstream &strm )
{
   vector< vector<INT> > result;
   string line;
   int counter, num;
   char c;

   while ( getline( strm, line ) )
   {
      stringstream ss( line );
      ss >> counter;
      vector<INT> row;
      for ( int i = 0; i < counter; i++ ) 
      {
         ss >> c >> num;
         row.push_back( 1ULL << num );                               // number is stored as the bit position, not as value
      }
      result.push_back( row );
   }
   return result;
}

//=====================================================================

void add( const vector<INT> &A, int &size )
{
   // Read current collection of sequences from disk
   INT *B = new INT[size];
   ifstream in( "results.dat", ios::binary );
   in.read( (char *)B, size * sizeof( INT ) );
   in.close();

   // Decide how many passes to use (avoids putting everything in a set in one go)
   int maxPass = 1 + A.size() * size / 10000000;
   cout << "Number of passes = " << maxPass << endl;


   ofstream out( "results.dat", ios::binary );
   int total = 0;
   set<INT> S;

   for ( int pass = 0; pass < maxPass; pass++ )                      // only going to deal with a fraction of sequences each pass
   {
      for ( int i = 0; i < size; i++ )
      {
         INT b = B[i];
         for ( INT a : A )
         {
            if ( ( a & b ) == 0 )                                    // no duplicates
            {
               INT num = ( a | b );                                  // new sequence
               if ( num % maxPass == pass ) S.insert( num );         // only a subset included each pass
            }
         }
      }
      int part = S.size();
      INT *C = new INT[part];
      copy( S.begin(), S.end(), C );
      out.write( (char *)C, part * sizeof( INT ) );   cout << "Pass: " << pass + 1 << endl;
      delete [] C;
      total += part;
      S.clear();
   }
   out.close();
   delete [] B;

   size = total;
   cout << "Size so far: " << size << endl;
}

//=====================================================================

int main()
{
   ifstream in( "data.txt" );
   vector< vector<INT> > A = readFile( in );
   cout << "Number of rows read = " << A.size() << endl;
   in.close();

   sort( A.begin(), A.end(), []( auto a, auto b ){ return a.size() > b.size(); } );      // tries to remove most duplicates first

   ofstream strm( "results.dat", ios::binary );
   INT zero = 0ULL;
   strm.write( (char *)&zero, sizeof( INT ) );
   strm.close();

   int size = 1;
   for ( int row = 0; row < A.size(); row++ )
   {
      cout << "\nIncluding row " << row + 1 << " ...\n";
      add( A[row], size );    
   }

   cout << "\nNumber of possible sequences: " << size << endl;
  
   // You will need to read and decode them
   int N = 10;
   vector<INT> V( N );
   ifstream fin( "results.dat", ios::binary );
   fin.read( (char *)V.data(), N * sizeof( INT ) );
   fin.close();

   cout << "The first " << N << " are:\n";
   for ( int i = 0; i < N; i++ ) print( decode( V[i] ) );
}


Number of rows read = 8

Including row 1 ...
Number of passes = 1
Pass: 1
Size so far: 27

Including row 2 ...
Number of passes = 1
Pass: 1
Size so far: 558

Including row 3 ...
Number of passes = 1
Pass: 1
Size so far: 8215

Including row 4 ...
Number of passes = 1
Pass: 1
Size so far: 117888

Including row 5 ...
Number of passes = 1
Pass: 1
Size so far: 1254015

Including row 6 ...
Number of passes = 3
Pass: 1
Pass: 2
Pass: 3
Size so far: 8490614

Including row 7 ...
Number of passes = 14
Pass: 1
Pass: 2
....             Lines removed
....
Pass: 13
Pass: 14
Size so far: 64099944

Including row 8 ...
Number of passes = 90
Pass: 1
Pass: 2
....             Lines removed
.... 
Pass: 89
Pass: 90
Size so far: 316370761

Number of possible sequences: 316370761
The first 10 are:
1  3  6  8  14  20  33  35  
2  5  7  8  14  20  33  35  
3  4  6  9  14  20  33  35  
2  3  8  9  14  20  33  35  
6  7  8  9  14  20  33  35  
3  5  8  10  14  20  33  35  
1  7  8  10  14  20  33  35  
4  7  9  10  14  20  33  35  
1  4  6  11  14  20  33  35  
1  2  8  11  14  20  33  35 

@lastchance, thanks a lot. It takes 12 mins to run on my work PC.
Topic archived. No new replies allowed.