IntroSort explenation

IntroSort is an hybrid algorithm that starts with QuickSort and switches to HeapSort.

I've made the IntroSort multi-threaded algorithm and on average it sorts 100 million random floats in 27 seconds on my 4-core 1.5 GHz processor or for 10 million in 2400 milliseconds. Which is fastest of all algorithms I made so far.

I've run some tests and I've noticed that it best performs when it switches to heapsort after log2(n) iterations of quicksort. Although I heard the recommendation is to switch after 2*log2(n).
My question is why to switch depending on number of executions and why is this combination better (for random numbers) then using just quicksort or just heapsort?

Also what do you think about the execution time? (considering my processor speed and number of cores)
I'm sure I could make it faster, but this seems pretty fast as it is.

SOLUTION:
So this is what I'm thinking is happening: when quicksort badly partitions the array (causing the complexity to go above n log n), the shorter part gets quickly sorted my quicksort and the long part from some reason keeps getting bad partitioning and slows down the sorting process, so switching to heapsort after log n iterations ensured that badly partitioned part of the array gets sorted faster using heapsort.
Switching to heapsort too early probably is not using the efficiency of quicksort and switching too late probably lets quicksort badly partition the array for too long, so I found that switching to heapsort after log2(N) iterations of quicksort is most optimal for the process.
Last edited on
I'm not sure how you managed to implement a multi-threaded introsort without knowing the answer to your question.

Multi-threaded only makes a difference if each thread is actually running on a different core. Further, it's a red-herring. The efficiency of the algorithm does not change -- you are simply (assuming you are using all four cores) performing some sorting in parallel, which reduces the overall time necessary to sort, but not the overall amount of work.

Musser's paper is really very easy read, and he explains quite simply the purpose of the switch:

- a quicksort typically executes faster than a heapsort
- except when it has bad data, when its execution becomes really inefficient
- heapsort cannot be made inefficient by bad data

Musser noticed that the cutoff is easily calculated by the number of subproblems generated by a quicksort -- if it branches down to more than 2*floor(log2(n)) recursions then the quicksort becomes less efficient than a heapsort, so rather than continue it just switches to heapsort.

In other words, if the quicksort has to switch to a heapsort, it is bottoming out at its worst performance.

What I suspect is happening, (assuming you are successfully partitioning the data to all four cores) is that you are computing the problem depth based upon N, instead of N/4. In other words, you've broken the efficiency switch by whatever you've done to make the algorithm work multi-core.

Remember, 2*log2(n/4) = 2*(log2(n) - log2(4)) = 2*(log2(n) - 2) = 4*(1/2*log2(n))

Take out that 4 (for the number of cores) you should have a worst-case efficiency switch at log2(n)/2.

Your introsort is performing sub-optimally.


Keep in mind as well:
- quicksort has nice locality of reference
- heapsort never does

Don't feel too bad. Introsort is tricky to get right. Fix your errors and you should have a very efficient algorithm.

As a matter of (rhetorical) interest:
Your multi-threaded algorithm works by
(1) determining if your data is actually large enough to need multiple cores, and how many it would like
(2) identifying the actual number of cores it can use
(3) performs the first 1 or 2 iterations of quicksort to split the data to the requisite number of partitions (== number of cores to be used)
(4) initiates a thread for each core that runs a normal introsort on its data partition
Correct?

Oh, also, I hope you aren't actually using std::log2() to calculate the maximum recursion depth. And that you are using the deferred insertion sort at the end.

Hope this helps.

[edit] Sorry, forgot this: http://www.cplusplus.com/faq/sequences/sequencing/sort-algorithms/introsort/
Last edited on
My question wasn't about multi-threading, and the threading is not relevant for the question.
But if you're interested it splits 8 times (if the step has more than certain length to process, plus some other limits), because partitioning can never be even, so this ensures that there are always threads executing on all cores. (btw I use median of 3 when partitioning)
And I have my own function for calculating log2, specially for integers.

The thing I didn't understand is why switch after log2(n) iterations? Why not after the remaining length is of certain size?
I'm not so sure you've read my answer carefully.

THREADING
First, the threading most definitely is relevant, just not for the reason I think you think it is.

Unless you are actually going to be sorting enough objects in parallel (meaning, actually on separate CPUs) to offset the cost of creating a new thread -- then it is not worth it.

Creating multiple threads to sort a single sequence on the same core is never worth the overhead cost of creating the thread -- because only one thread can operate on a single core at a time. In other words, the time to sort data using one thread and the time to sort data using two threads are related thus:

    one thread: T(sort n objects)
    two threads: T(sort n/2 objects) + T(create and destroy new thread) + T(sort n/2 objects)

That simplifies:

    one thread: T(sort n objects)
    two threads: T(sort n objects) + T(create and destroy new thread)

I'm not sure what you mean by 'the partitioning never being even'. You can always partition to use all available cores, even if there are an odd number of cores. (There's no rule that says the top-level partition need be bipartite. Or even that you give the same number of elements to each core to sort.)

MAX-DEPTH
I explained my suspicion quite clearly, but you have yet to confirm or refute my hypothesis.

What's n?

Is it the total number of items being sorted by all threads? (If it is, your algorithm is twice as inefficient as it should be.)

Is it the number of items being sorted by a single thread? (If it is, your algorithm is four times as efficient as is actually possible.)

There is something else going on, but without seeing your code I can only speculate on what that is.
I'm really starting to suspect your competence regarding threading...

But I see you have enough time to write so long post so why don't you invest some of that time and write a threaded intro sort and tell me what do you think after that?

I'd rather talk with someone who has experimented with algorithms like this.
It seems to me that if quicksort partitions the data exactly in half each time, then after log2(n) iterations it will have 1 item in the partition. So it makes sense that after that many iterations you might want to switch to heapsort.

In your original post, it seems that sorting 10x the numbers takes just over 10x the time which is good. As for the absolute numbers I don't know if they are high or low, but you could compare to std::sort() or qsort().

Have you looked at radix sort? For a large number of integers it might be faster than a comparison-based sort.
This may be of interest:

Spreadsort: http://en.wikipedia.org/wiki/Spreadsort

Sort library (hybrid sorting algorithms based on spreadsort, by Steven Ross, the inventor of spreadsort) was recently reviewed and accepted into Boost.
http://eldiener.github.io/sort/doc/overview.html
dhayden, I think you wanted to say if partitioning is bad, and you have more than one element after log2(n) iterations then switching to heapsort speeds it up.
But the question is why does this speed it up?

And I have compared my quicksort algorithm (not introsort) to std::sort() - talking about single-threaded algortihms now - and the standard on crashed for arrays longer than 100.000 for some reason, but for the values it did work I noticed that it is about twice slower than my version, but it's execution kept slowing down (relative to mine) as size increased, so I suspect std::sort() doesn't use median to determine the pivot, so it doesn't partition the array well. So I'm satisfied with my version. I didn't try std::qsort though.
And my introsort is much faster than my quicksort (especially when compared for lenghts above 1 milion)

And I'm afraid I can't use radix sort as my algorithm has to work with with both integers and floats.
I used a template when coding it, so you can call it with whatever type of data you want.

JLBorges, sorry, but you're off-topic.
I've posted my opinion in the original post. You may comment your opinion of it in case you don't agree.
Topic archived. No new replies allowed.