advice needed on working with large data

Hi all,

I'd like to compare a small vector of int to another large one according to some features. Is it better to copy sub parts from the large vector each time till i find the desired result. OR work on the same large vector and only update the indices?

Thanks
Last edited on
Can you please edit your post and remove the offending "everything made bold" thingie? Thanks.

About your question: Can you illustrate your problem a bit more, e.g. with some sample code? I don't quite get what you ask.

Ciao, Imi.
ok , given a vector of integers A (about 50 elements) and another vector B (about 16000 elements), i have to find 50 elements in B which are the most similar to A.

Which is better?
1- every iteration, i copy 50 elements from B to a new vector and compare it to A
2- or only operate on B without making any copies, but only update the indices to B
well, I can see how 2) would work, but how would you implement 1)?

So lets say you have a function like..

1
2
3
4
5
6
7
// the bigger, the moer similar both parameter are
int similar(int a, int b);

vector<int> A(50); 
// fill A
vector<int> B(16000);
// fill B 


Then your algorithm 2 is to loop over both vectors calling "similar" for about 50*16000 times and cherry-pick the 50 lowest numbers, right? (e.g. by keeping them in a sorted map with similarity as key and the index into B as value)

1
2
3
4
5
6
7
8
9
10
11
map<int, vector<int>::iterator> result;
for (auto a = A.begin(); a != A.end(); ++a) {
  int best = 0;
  vector<int>::iterator bestIt = 0;
  for (auto b = B.begin(); b != B.end(); ++b) {
    if (find(result.begin(), result.end(), b) != result.end()) continue;
    int sim = similar(a,b)
    if (best < sim) { bestIt = b; best = sim; }
  }
  result[best] = bestIt;
}

Sounds pretty straight forward to me. (Code untested, just written to illustrate)


How does your algorithm 1 look like?

Ciao, Imi.
Last edited on
So, the first algorithm make sense if there is already an implemented function that compares two arrays of 50 elements long. Then I would find an element in B equal to the first one in A, i.e. B[i]=A[1]. Then, copy i...i+49 elements into some temporary array and sent it along with A to the above function.

Whether it is faster than the second algorithm? Probably, no. Because, the problem at hand might be implemented by playing with indexes only. The latter is cheaper.
Last edited on
Whether it is faster than the second algorithm?

Better == faster?

I'd thought "better == easier to understand, maintain, teach, extend, write" :-D

Ciao, Imi.
So the first one is more meaningful and maintainable but the second one is faster. I think one should take the decision now :D, in fact i need that in a game, so i should pick up the cheaper algorithm.
In fact, the idea isn't to get the most similar 50 integers individually, but it seems to getting the most similar region or part (50 elements in sequence).


ah, ok.. that makes much more sense then :-). Well, I recommend you write the similar - method as taking iterators to a vector then. Then you can "move" your search frame over the big array. This would be a variant of your 2), but using iterators instead of numerical indices.

1
2
3
4
5
6
7
8
9
10
11
12
typedef vector<int>::iterator IT;
int similar(const IT& start, const IT& end, IT reference)
{
  for (auto b = start; b != end; ++b) // iterate over area in B
  {
    // compare for similarity here. Maybe return if a threshold is reached etc.. whatever...
    ++reference; // increase pointer to area in A
  }
}

for (auto b = B.begin(); b != B.end()-50; ++b) // move frame over B
  ... similar( b, b+50, A.begin() ); ...


This way you don't need to copy anything around, which is surely more efficient than copying vectors all the day long (your 1.). It's also the way how most of the STL works (the algorithms there) so it will be more familar to people used to the STL than if you use integer-indices into B. You could even try to use some of the STL-algorithms (std::accumulate springs in mind). Finally, it works with any kind of container so you are not nailed to vector. (Actually, "B.end()-50" needs a random access iterator, but you can easily replace this by keeping two variables in your for-loop).

Ciao, Imi.
PS: Oh, just for the record: Of course don't use -50 but -A.size() and make sure A is really smaller than B :-D
Last edited on
I really appreciate your effort, i will go on the second option. You remind me of STL Algorithms as well, thanks for drawing it to my attention.
Topic archived. No new replies allowed.