Fastest comparison algorithm based on a predefined container?

Pages: 123
My sentiments exactly, dhayden. Ironically I fell into the trap by using the phrase "selection sort". I never had an official course in CS that taught me sorting algorithms so I always get names mixed up.
> CS classes teach all kinds of different sorting algorithms and we learn that N*log(N) is the fastest that any sorting algo ran run

To be fair to CS classes, they also do talk about O(N) radix sorts and hash tables with amortised O(1) lookup.


In this case, we are not re-ordering a sequence which may contain duplicates (sort algorithm), we are re-ordering a set that contains unique values when we have a pre-created ordered superset (that also contains unique values).

The very same problem:
We have a dictionary of 200,000 words in English. This set of words does not change.
We need to do this many times: given an unordered subset of 1000 or so unique words, put them in dictionary order.

Here, the sort algorithm that was taught in the CS class or the Smac89 algorithm would be an order of magnitude faster than the dhayden algorithm. In conjunction with the Smac89 approach, use a radix sort based on the word position in the dictionary and the difference would be even more dramatic.
I assume that if we wanted to sort subset in reverse order of the master set and such and such, dhayden's method is still the fastest for large subsets (just iterate backwards instead)?

Even crazier, order subset according to the order:
first element of master < last element of master <
second element of master < second last element of master <
third element of master < third last element of master < ....

Then simply use dhyaden's method again, but iterate alternatingly as described above?
Last edited on
> dhayden's method is still the fastest for large subsets

Yes. With a stable master set, large being defined as: where (M/N) < log N

> Even crazier, order subset according to the order ... iterate alternatingly as described above?

Yes; the master needs to have bidirectional iterators.

> Then simply use dhyaden's method again

Yes; if the subset is 'large'.
Ok, I compared the 3 variants of sorting subset according to the master (without changing master itself). Someone may perhaps double-check my test #3, because the timing is pretty much the same as test #1 when I thought it should be slower:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#include <iostream>
#include <string>
#include <random>
#include <ctime>
#include <algorithm>
#include <vector>
#include <cstdlib>
#include <unordered_map>
#include <unordered_set>

struct object
{
    std::string id ;
    object( std::string&& name = "" ) : id( std::move(name) ) {}
};

object random_object()
{
    static char alphabet[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" ;
    static constexpr std::size_t N = sizeof(alphabet) - 1 ;
    static std::mt19937 twister( std::time(nullptr) ) ;
    static std::uniform_int_distribution<std::size_t> distr( 2, 10 ) ;

    std::shuffle( alphabet, alphabet+N, twister ) ;
    return { { alphabet, alphabet + distr(twister) } } ;
}

int main()
{
    const std::size_t N = 1024 * 128 ;

    std::vector<object> master ;
    while( master.size() < N ) master.push_back( random_object() ) ;
    std::cout << "Master size: " << master.size() << '\n' ;

    std::srand( std::time(nullptr) ) ;
    std::vector< object > _subset;  // create random subset
    for( std::size_t j = 0 ; j < N ; j += 2 ) _subset.push_back( master[j] ) ;
    std::random_shuffle( _subset.begin(), _subset.end() ) ; // rearrange subset
    std::vector< object > subset[4] = {_subset, _subset, _subset, _subset};  // all 4 tests should use the same subset (test #4 is in the next post)
		
    {	// Test #1. Sort subset[0] according to order in master.
        const auto start = std::clock() ;
	// destructively move subset elements to a hash table
        std::unordered_set<std::string> set ;
        for( object& obj : subset[0] ) set.insert( std::move( obj.id ) ) ;
        std::vector< object > temp ; // scratch area
        // iterate through the master, look up, append to temp
        for( const object& obj : master )
        {
            auto iter = set.find(obj.id) ;
            if( iter != set.end() )
            {
                std::string mvalue = *iter ;
                temp.emplace_back( std::move(mvalue) ) ;
            }
        }
        subset[0].swap(temp) ; // copy temp to subset, discard subset
        const auto end = std::clock() ;
        std::cout << "Test #1.  Subset size: " << subset[0].size() << ".  Result: " << ( end - start ) / double(CLOCKS_PER_SEC) << " seconds.\n" ;
    }
    {	// Test #2.  Sort subset[1] according to reverse order in master.
        const auto start = std::clock() ;
        std::unordered_set<std::string> set ;
        for( object& obj : subset[1] ) set.insert( std::move( obj.id ) ) ;
        std::vector< object > temp ;
        for(std::vector<object>::const_reverse_iterator it = master.crbegin(); it != master.crend(); ++it)
        {
            auto iter = set.find(it->id) ;
            if( iter != set.end() )
            {
                std::string mvalue = *iter ;
                temp.emplace_back( std::move(mvalue) ) ;
            }
        }
        subset[1].swap(temp) ;
        const auto end = std::clock() ;
        std::cout << "Test #2.  Subset size: " << subset[1].size() << ".  Result: " << ( end - start ) / double(CLOCKS_PER_SEC) << " seconds.\n" ;
    }
    {	// Test #3.  Sort subset[2] according to master[0] < master[N-1] < master[1] < master[N-2] < master[2] < master[N-3] < master[3] < ...
        const auto start = std::clock() ;
        std::unordered_set<std::string> set ;
        for( object& obj : subset[2] ) set.insert( std::move( obj.id ) ) ;
        std::vector< object > temp ;
        std::vector<object>::const_iterator it = master.begin(),  it_halfway = std::next (master.begin(), N/2);
        std::vector<object>::const_reverse_iterator jt = master.crbegin(),  jt_halfway = std::next (master.crbegin(), N/2);
	while (it != it_halfway)  // while (it != master.end()) is wrong
	{
		while (jt != jt_halfway)  // while (jt != master.crend()) is wrong
	        {
	            auto iter = set.find(it->id) ;
	            if( iter != set.end() )
	            {
	                std::string mvalue = *iter ;
	                temp.emplace_back( std::move(mvalue) ) ;
	            }
	            iter = set.find(jt->id) ;
	            if( iter != set.end() )
	            {
	                std::string mvalue = *iter ;
	                temp.emplace_back( std::move(mvalue) ) ;
	            }
	            ++it;  ++jt;
	        }
	}
        subset[2].swap(temp) ;
        const auto end = std::clock() ;
        std::cout << "Test #3.  Subset size: " << subset[2].size() << ".  Result: " << ( end - start ) / double(CLOCKS_PER_SEC) << " seconds.\n" ;
    }
}
/*
Output:
Master size: 131072
Test #1.  Subset size: 72841.  Result: 0.063 seconds.
Test #2.  Subset size: 72841.  Result: 0.078 seconds.
Test #3.  Subset size: 72841.  Result: 0.063 seconds.
*/


Actually, I think test #3 can be simplified a bit more with
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{	// Test #3.  Sort subset[2] according to master[0] < master[N-1] < master[1] < master[N-2] < master[2] < master[N-3] < master[3] < ...
        const auto start = std::clock() ;
        std::unordered_set<std::string> set ;
        for( object& obj : subset[2] ) set.insert( std::move( obj.id ) ) ;
        std::vector< object > temp ;
        std::vector<object>::const_iterator it = master.begin();
        std::vector<object>::const_reverse_iterator jt = master.crbegin();
        std::size_t count = 0;
	while (count < N/2)
        {
            auto iter = set.find(it->id) ;
            if( iter != set.end() )
            {
                std::string mvalue = *iter ;
                temp.emplace_back( std::move(mvalue) ) ;
            }
            iter = set.find(jt->id) ;
            if( iter != set.end() )
            {
                std::string mvalue = *iter ;
                temp.emplace_back( std::move(mvalue) ) ;
            }
            ++it;  ++jt;  ++count;
	}
        subset[2].swap(temp) ;
        const auto end = std::clock() ;
        std::cout << "Test #3.  Subset size: " << subset[2].size() << ".  Result: " << ( end - start ) / double(CLOCKS_PER_SEC) << " seconds.\n" ;  // 0.062
}

and it still gives the same timing result as test #1.
Last edited on
And finally, I can't let this thread go without looking into the most general variant (without changing master itself):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{	// Test #4.  Sort subset[3] according to master[n0] < master[n1] < master[n2] < master[n3] < ..., where {n0,n1,n2,...} is any permutation of {0,1,2,...,N-1}.
	std::vector<int> indices(N);
	std::iota (indices.begin(), indices.end(), 0);
	std::random_shuffle (indices.begin(), indices.end());  // indices is the permutation of  {0,1,2,...,N-1} that will define the ordering of subset[3].

        const auto start = std::clock() ;
        std::unordered_set<std::string> set ;
        for( object& obj : subset[3] ) set.insert( std::move( obj.id ) ) ;
        std::vector< object > temp ;
        for (int x : indices) 
        {
            auto iter = set.find(std::next(master.begin(), x)->id) ;
            if( iter != set.end() )
            {
                std::string mvalue = *iter ;
                temp.emplace_back( std::move(mvalue) ) ;
            }
	}
        subset[3].swap(temp) ;
        const auto end = std::clock() ;
        std::cout << "Test #4.  Subset size: " << subset[3].size() << ".  Result: " << ( end - start ) / double(CLOCKS_PER_SEC) << " seconds.\n" ;
}

Master size: 131072
Test #1.  Subset size: 72853.  Result: 0.079 seconds.
Test #2.  Subset size: 72853.  Result: 0.078 seconds.
Test #3.  Subset size: 72853.  Result: 0.063 seconds.
Test #4.  Subset size: 72853.  Result: 0.079 seconds. 


Unless my code for test #3 and test #4 are wrong, generalizing seems to show no signs of performance loss (I thought set.find(std::next(master.begin(), x)->id) would slow things down a lot).
Last edited on
LB wrote:

...I never had an official course in CS that taught me sorting algorithms so I always get names mixed up.


Online class:
Part1: Started Sept 5th
https://www.coursera.org/course/algs4partI

Part2: Next session starts October 31st
https://www.coursera.org/course/algs4partII

I did these courses over this past summer and they are lot of fun! It does require atleast a beginner experience with java for the first part and you should be good for second part after doing the first. It is fun, free, and you will come out of it with a very good knowledge of sorting algorithms.

Btw for anyone interested, there are more courses on that website that deal with algorithms.

P.S. there is even one for android programming...
https://www.coursera.org/course/android
http://lmgtfy.com/?q=define+ps
Last edited on
Why is the course free?
> Actually, I think test #3 can be simplified a bit more with ...

Or with:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
        auto it = master.begin() ;
        auto jt = master.crbegin() ;
        while ( it < jt.base() )  
	    {
	        {
	            auto iter = set.find(it->id) ;
	            if( iter != set.end() )
	            {
	                std::string mvalue = *iter ;
	                temp.emplace_back( std::move(mvalue) ) ;
	            }
	            iter = set.find(jt->id) ;
	            if( iter != set.end() )
	            {
	                std::string mvalue = *iter ;
	                temp.emplace_back( std::move(mvalue) ) ;
	            }
	            ++it;  ++jt;
	        }
	    }

Both relying on the knowledge that N is even.

All three should perform about equally; #3 would be a wee bit faster because the loop is unrolled.
It is a quality of implementation issue, but #2, #3 (partly) may suffer if the increment of the reverse random access iterator is not fully optimized.
http://coliru.stacked-crooked.com/a/600a3d84d45cbb3f

> (I thought set.find(std::next(master.begin(), x)->id) would slow things down a lot).

It will, if a [tt]std::list<>[//tt] holds the master set.
Last edited on
prestokeys wrote:
Why is the course free?


Enjoy it while it lasts. These are real courses from some top universities so take advantage while you can
Topic archived. No new replies allowed.
Pages: 123