Fastest comparison algorithm based on a predefined container?

Pages: 123
Yikes! I missed that. Ok, I changed my code above so that v is reset back to subset after each sort. The move constructor calls affect the performance, but all the tests are affected in the same way, so each test lose the same amount of time--about 0.1 seconds. The new results are posted, using my Pentium i7 4770 Windows 8.1 64-bit, 2GB RAM, GCC 4.8.1. I also don't get why map is faster than unordered_map. But the lack of a hash function is certainly a big factor.
Last edited on
From the links I posted, try using either one of the hash functions found there for unordered_map and see if that improves performance. For std::map, that is as good as it gets using that method
Smac89's algorithm can be improved by recognizing that you only need to map the subset, not the master set:

1
2
3
4
5
6
7
8
9
Put the subset items in a content-addressable collection like a set or a hash table.

n=0
For each element in the master set
    if you find the element in the collection then
        it is the n'th item
        ++n
    }
} 

If this has to be done many times against the same master sequence:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <iostream>
#include <string>
#include <random>
#include <ctime>
#include <algorithm>
#include <vector>
#include <cstdlib>
#include <unordered_map>

struct object
{
    std::string id ;
    object( std::string&& name = "" ) : id( std::move(name) ) {}
};

object random_object()
{
    static char alphabet[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" ;
    static constexpr std::size_t N = sizeof(alphabet) - 1 ;
    static std::mt19937 twister( std::time(nullptr) ) ;
    static std::uniform_int_distribution<std::size_t> distr( 2, 10 ) ;

    std::shuffle( alphabet, alphabet+N, twister ) ;
    return { { alphabet, alphabet + distr(twister) } } ;
}

int main()
{
    const std::size_t N = 1024 * 128 ;

    std::vector<object> master ;
    while( master.size() < N ) master.push_back( random_object() ) ;
    std::cout << "master size: " << master.size() << '\n' ;

    std::unordered_map< std::string, std::size_t > rank ;
    for( std::size_t i = 0 ; i < N ; ++i ) rank.emplace( master[i].id, i ) ;

    constexpr int REPEAT = 5 ;
    std::srand( std::time(nullptr) ) ;
    std::vector< object > subset ;
    for( int i = 0 ; i < REPEAT ; ++i )
    {
        // create random subset
        subset.clear() ;
        for( std::size_t j = i ; j < N ; j +=  2 + i*2 ) subset.push_back( master[j] ) ;
        std::random_shuffle( subset.begin(), subset.end() ) ;

        // rearrange subset
        const auto start = std::clock() ;
        std::sort( subset.begin(), subset.end(),
                   [&rank] ( const object& a, const object& b ) { return rank[a.id] < rank[b.id] ; } ) ;
        const auto end = std::clock() ;
        std::cout << "subset size: " << subset.size() << ' ' << ( end - start ) / double(CLOCKS_PER_SEC) << " seconds.\n" ;
    }
}

clang++ -std=c++11 -stdlib=libc++ -Wall -Wextra -pedantic-errors -O3 main.cpp -lsupc++ -o test && ./test
echo --------------- && g++-4.9 -std=c++11 -Wall -Wextra -pedantic-errors -O3 main.cpp -o test && ./test
master size: 131072
subset size: 65536 0.44 seconds.
subset size: 32768 0.19 seconds.
subset size: 21845 0.11 seconds.
subset size: 16384 0.09 seconds.
subset size: 13107 0.06 seconds.
---------------
master size: 131072
subset size: 65536 0.43 seconds.
subset size: 32768 0.19 seconds.
subset size: 21845 0.12 seconds.
subset size: 16384 0.09 seconds.
subset size: 13107 0.07 seconds.

http://coliru.stacked-crooked.com/a/ea2dfa712a2d6d69


> Pentium i7 4770 Windows 8.1 64-bit, 2GB RAM

You need more memory (For Windows 8.1 64-bit, 8GB or so appears to be the sweet spot for normal usage).

On this laptop (Windows 8.1 64-bit, Pentium i5 3320M (2.6 GHz), 16 GB), GCC 4.9 (MinGW, -O3):
master size: 131072
subset size: 65536 1.222 seconds.
subset size: 32768 0.551 seconds.
subset size: 21845 0.354 seconds.
subset size: 16384 0.26 seconds.
subset size: 13107 0.195 seconds. 
Last edited on
@dhayden
Your suggestion is good, but unless your subset is a symbol table, the complexity of the problem is now at M * N, where M is the size of the master list and N is size of subset

Now if you do plan to make the subset a symbol table, then using your idea, it will be impossible to rearrange the items in the subset based on their position in the master list because you will be invalidating the mappings.

The next solution will be to build a new subset based on the information you have, but this will mean the memory requirements of the program will increase
> the complexity of the problem is now at M * N, where M is the size of the master list and N is size of subset

The time complexity is: hashmap subset O(N) + lookup for each in master O(M) + sort subset O(N log N)
== O( N log N )

In the other case, it is: hashmap master O(M) + look up for each in subset O(N) + sort subset O( N log N )
== O( N log N )

In either case, unless M is very large and N is very small ( M/N is larger than log N) , the time to sort would dominate.
Last edited on
@Smac89
unless your subset is a symbol table

As I said in my post, put the subset in a content-addressable collection like a set or a hash table.

if you do plan to make the subset a symbol table, then using your idea, it will be impossible to rearrange the items in the subset based on their position

A hash table of pointers to the items, or (pointer, sorted position) pairs, would work.

The next solution will be to build a new subset based on the information you have, but this will mean the memory requirements of the program will increase

Your original algo required a map of the master list with an integer. If M and S are the number of items in the master and subset collections then yours requires k1M extra memory and mine requires k2S. The one with least memory depends on the values of K1, k2, M and S.

@JLBorges
The time complexity is: hashmap subset O(N) + lookup for each in master O(M) + sort subset O(N log N) == O( N log N )

My algorithm isn't a comparison based sort, so the O(N log N) addend isn't there.
> so the O(N log N) addend isn't there.

How do you intend to rearrange the original subset?

EDIT:

Ok, I think I understood your algorithm now.

Master set: Smaster
Original subset: Soriginal
Map: Smap
Ordered copy of subset: Stemp

1. Insert elements in Soriginal into Smap
2. Iterate through Smaster; if found in Smap write to next position in Stemp
3. Copy Stemp to Soriginal
4. Discard Stemp.
Last edited on
The elevation of the discussion is very theoretical and out of my reach (even though I teach math). But do we at least all agree that the fastest algorithm is to use std::unordered_map to set the rank of the elements in the master set? In particular, the naïve OP method of simply iterating from the beginning until the first of the two elements being compared is found is only effective if the subset is small and ineffective if it is in the order of hundreds or thousands in size?
1
2
3
4
5
6
7
8
9
Master set: Smaster
 Original subset: Soriginal  
 Map: Smap 
 Ordered copy of subset: Stemp 

 1. Insert elements in Soriginal  into Smap
 2. Iterate through Smaster; if found in Smap write to next position in Stemp
 3. Copy Stemp to Soriginal 
 4. Discard Stemp. 

I would be very excited to put this into code and benchmark it against JLBorges' test, once I understand it. I don't think this thread is finished until we've (or at least I) have fully considered dhayden's alternative.

Ok, so for example,
master = {a,b,c,d,e,f,g,h,i,j}
subset = {e,a,i}
Then map = { {e,5}, {a,1}, {i,9} }
Now sort according the map (which only has 3 elements, and hence the search is faster). 1 < 5 < 9, so a < e < i. Is that the intention?
Last edited on
I believe the change needed to JLBorges' test to test out dhayden's method is to simply move his rank unordered_map definition forward to the point right before the timing begins (before the 'start' definition) and redefine it to:
1
2
3
4
5
	    std::unordered_map< std::string, std::size_t > rank ;
	    for( std::size_t i = 0 ; i < N ; ++i ) {
			if (std::find(subset.begin(), subset.end(), master[i]) != subset.end())
				rank.emplace( master[i].id, i ) ;
		}

But it won't compile because GCC complains of object and const object being the operands of ==.
I'm probably not understanding the situation, but it sounds like something for std::set_intersection. http://ideone.com/DAmPmE
dhayden's method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#include <iostream>
#include <string>
#include <random>
#include <ctime>
#include <algorithm>
#include <vector>
#include <cstdlib>
#include <unordered_map>
#include <unordered_set>

struct object
{
    std::string id ;
    object( std::string&& name = "" ) : id( std::move(name) ) {}
};

object random_object()
{
    static char alphabet[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" ;
    static constexpr std::size_t N = sizeof(alphabet) - 1 ;
    static std::mt19937 twister( std::time(nullptr) ) ;
    static std::uniform_int_distribution<std::size_t> distr( 2, 10 ) ;

    std::shuffle( alphabet, alphabet+N, twister ) ;
    return { { alphabet, alphabet + distr(twister) } } ;
}

int main()
{
    const std::size_t N = 1024 * 128 ;

    std::vector<object> master ;
    while( master.size() < N ) master.push_back( random_object() ) ;
    std::cout << "master size: " << master.size() << '\n' ;

    constexpr int REPEAT = 5 ;
    std::srand( std::time(nullptr) ) ;
    std::vector< object > subset ;
    for( int i = 0 ; i < REPEAT ; ++i )
    {
        // create random subset
        subset.clear() ;
        for( std::size_t j = i ; j < N ; j +=  2 + i*2 ) subset.push_back( master[j] ) ;
        std::random_shuffle( subset.begin(), subset.end() ) ; // rearrange subset

        const auto start = std::clock() ;

        // destructively move subset elements to a hash table
        std::unordered_set<std::string> set ;
        for( object& obj : subset ) set.insert( std::move( obj.id ) ) ;

        std::vector< object > temp ; // scratch area

        // iterate through the master, look up, append to temp
        for( const object& obj : master )
        {
            auto iter = set.find(obj.id) ;
            if( iter != set.end() )
            {
                std::string mvalue = *iter ;
                temp.emplace_back( std::move(mvalue) ) ;
            }
        }

        subset.swap(temp) ; // copy temp to subset, discard subset

        const auto end = std::clock() ;
        std::cout << "subset size: " << subset.size() << ' ' << ( end - start ) / double(CLOCKS_PER_SEC) << " seconds.\n" ;
    }
}

echo --------------- && g++-4.9 -std=c++11 -Wall -Wextra -pedantic-errors -O3 main.cpp -o test && ./test
master size: 131072
subset size: 72826 0.06 seconds.
subset size: 41196 0.05 seconds.
subset size: 29312 0.04 seconds.
subset size: 22929 0.03 seconds.
subset size: 18436 0.04 seconds.
---------------
master size: 131072
subset size: 72749 0.07 seconds.
subset size: 41092 0.04 seconds.
subset size: 29535 0.03 seconds.
subset size: 22821 0.02 seconds.
subset size: 18905 0.03 seconds.

http://coliru.stacked-crooked.com/a/9ef12d230659d8cf
Amazing speed up...up to a factor of 10 from mine
It looks like LB's intuition was correct right from the start, though he never stated dhayden's approach. We don't need to implement any comparison function at all.
Last edited on
The most recent code by JLBorges is exactly what I suggested ;p
@LB. When you said:

 
If you use selection sort you do not need a comparator. Just construct a new container by pulling elements from the old one in the order of the template.


I didn't know that you meant this, and interpreted it differently, as shown in my tests. I admit, you are quite genius. Your comments are so brief, it's hard to understand you sometimes (it happened a few months ago too, in another thread of mine when you said "Just use std::is_enum", when in the end the full solution was to write a complex algorithm using std::enable_if and this and that ... and somewhere in the middle is std::is_enum).
Last edited on
Well the reason why you don't need a comparison function is because he wasn't hashing the object itself, just the string inside the object. And strings have a built in comparison mechanism, and I am pretty sure std::hash works for them to. Btw std::hash is what std::unordered_set and std::unordered_map uses to hash keys to a bucket

http://en.cppreference.com/w/cpp/utility/hash
> If you use selection sort you do not need a comparator.
> Just construct a new container by pulling elements from the old one in the order of the template.

Selection sort is an in-place comparison sort. This is not a selection sort.

In addition to the hash function, this too requires a comparator - the members of the set must be EqualityComparable.
I don't know why I said selection sort, I meant something completely different. My apologies.
This points out a deep flaw in many people's understand of algorithms. CS classes teach all kinds of different sorting algorithms and we learn that N*log(N) is the fastest that any sorting algo ran run, but many people forget the initial premise for all of them: the only thing you can do is compare items and move them. Few problems are really this limited. In this example, we can deduce the order from another data structure.

So when looking at a problem, don't think "hey, this is an instance of problem XYZ that I learned about in school." Instead, think about how this problem is different from XYZ. You may find that the difference can be exploited with tremendous results.

My best real-world example came years ago when we had a stream of objects and we needed to know if each one was in a set of around 20 million. One guy did an analysis and said we couldn't possibly do it without some huge amount of database hardware. He assumed we had to check the database for each object.

I completely destroyed his argument by pointing out that most objects would NOT be in the database and if we optimized for that case, performance would be fine. The result was an in-memory data structure that is small, very fast and gives a "maybe or no" answer instead of a "yes or no" one. In other words, if the structure says "no" then you're guaranteed the object isn't in the database. But f the structure says "maybe" then you have to query the DB for a definitive answer. You can tweak the probability that a "maybe" will ultimately result in "yes" and we have that set to 999/1000. In other words, if we have to go to the expense of checking the data, there's a 99.9% probability that we will find the object there. This is still working now, years later, with the database at about 150 million items.

You can sum it up with this:
solve the right problem.
Pages: 123