8

Google couldn't help me so here it goes: which of the two selection algorithms, FloydRivest algorithm and Introselect, has better performance.

I'm assuming It's FloydRivest algorithm, but want to be 100% sure.

Also, if there exist even better algorithm for this purpose, I'd be glad to hear about them.

user2340939
  • 1,791
  • 2
  • 17
  • 44
  • 1
    If you have real data, get two implementations and test them. Both have similar big-O time and space complexities; however, two O(n) algorithms can still perform very differently depending on implementation and supplied data. That is why you will find few references to "X is best" - it depends. – tucuxi Apr 12 '15 at 18:01
  • @tucuxi I think you should expand your comment into an answer. – Anshul Goyal Apr 12 '15 at 18:04
  • @tucuxi Everything you said I know, that's why I'm asking this specific question, so I wouldn't need to waste time implementing both of them. Also, I'm interested in an average/real-world situation. – user2340939 Apr 12 '15 at 18:05
  • 2
    I left the comment so I wouldn't need to waste time implementing both of them - on my programming language, hardware and data, when you will be using it on something different. I doubt that you will find a >10% difference between two non-horrible implementations. Take one that is known and well tested, and use that. Optimize only if profiling shows this to be the bottleneck (which I really doubt). – tucuxi Apr 12 '15 at 19:53

1 Answers1

9

TLDR; I think Floyd-Rivest is better

I recently did some research on selection algorithms for a project I am working on. Here is a basic description of each algorithm:

  • Introselect: Performs a bipartitioning of the data, with a single pivot. Initially, a simple pivot (e.g. middle, median-of-3, etc) is chosen. The simple pivot is usually O(n^2) in the worst case, but O(n) on average. If the recursion level goes above a certain threshold, we fallback to a median-of-medians pivot. This is more costly to compute, but guarantees O(n) worst case.
  • Floyd-Rivest: Performs a quintary partition of the data, with two pivots. The two pivots are chosen so that the kth element lies between them, with high probability (this involves randomly sampling the data, and selecting two elements, through recursion, above and below what would be the nth-element). When the size of the partition becomes small enough, we select the pivots using a simpler method (e.g. median-of-3, etc)

As you can see, both are quite similar. Introselect starts with simple pivots, falling back to a complicated one; the Floyd-Rivest algorithm does just the opposite. The main difference is that introselect uses median-of-medians, whereas Floyd-Rivest uses a recursive sampling technique. So, I think a better comparison is median-of-medians vs Floyd-Rivest.

Which is better? From my research, it appears the hidden constants for Floyd-Rivest are smaller than median-of-medians. If I remember correctly, median-of-medians requires something like 5n comparisons (worst case), whereas Floyd-Rivest only needs 3.5n. Floyd-Rivest also uses a quintary scheme, which is better when the data can have lots of duplicates. Both introselect and Floyd-Rivest reduce to the same algorithm for small inputs, so you should get similar performance there (so long as you implement them the same). In my tests, Floyd-Rivest was 20%+ faster than all the other selection algorithms I tried. Though, I must admit, I did not test against a proper implementation of introselect that falls back to median-of-medians (I just tested the pseudo-introselect of libstdc++). In the original Floyd-Rivest paper, they themselves (who were co-authors of the median-of-medians approach) said median-of-medians "is hardly practical", and that the Floyd-Rivest algorithm was "probably the best practical choice".

So, it seems to me, Floyd-Rivest's pivoting technique is better than median-of-medians. You could implement introselect using Floyd-Rivest's pivoting, but then you might as well just do a pure Floyd-Rivest algorithm. I would recommend Floyd-Rivest as the best selection method.

Be warned! The original Floyd-Rivest paper gives an example implementation of their algorithm (this is the implementation listed on Wikipedia, at the time of writing this). However, this is a simplified version. From my tests, the simplified version is actually pretty slow! If you want a fast implementation, I think you'll need to implement the full algorithm. I recommend reading the paper "On Floyd and Rivest's SELECT algorithm" by Krzysztof C. Kiwiel. He gives a pretty good description of how to implement a fast Floyd-Rivest selection.

Azmisov
  • 6,493
  • 7
  • 53
  • 70
  • 1
    I don't believe Floyd-Rivest does a quintary partition, it is ternery which makes sense if there are 2 pivots. What it does say is it partition in quintiles. But I am quite sure, they are following Bentley & McIlroy's partition algorithm in the 1993 paper "Engineering a sort function". During the partition, you have 5 regions and 1 pivot value. Those 5 regions are 1) equal to pivot 2) less than pivot 3) unknown 4) greater than pivot 5) equal to pivot. After partition, regions 1 & 5 are moved to the middle, region 3 is now 0 elements. Net result : ternery partition – SJHowe Jul 02 '19 at 16:42
  • Yeah, technically the end result is a ternary partition. But you iteratively refine a quintary partition until it converges to the ternary partition. This quintary partitioning scheme is what gives the Floyd-rivest select its optimal computation complexity, which is why I pointed that out. – Azmisov Jul 02 '19 at 22:06