Top K smallest selection algorithm - O (n + k log n) vs O (n log k) for k << N

Question

I'm asking this in regards to Top K algorithm. I'd think that O(n + k log n) should be faster, because well.. for instance if you try plugging in k = 300 and n = 100000000 for example, we can see that O(n + k log n) is smaller.

However when I do a benchmark with C++, it's showing me that O (n log k) is more than 2x faster. Here's the complete benchmarking program:

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <ctime>
#include <cstdlib>
using namespace std;

int RandomNumber () { return rand(); }
vector<int> find_topk(int arr[], int k, int n)
{
   make_heap(arr, arr + n, greater<int>());

   vector<int> result(k);

   for (int i = 0; i < k; ++i)
   {
      result[i] = arr[0];
      pop_heap(arr, arr + n - i, greater<int>());
   }

   return result;
}

vector<int> find_topk2(int arr[], int k, int n)
{
   make_heap(arr, arr + k, less<int>());

   for (int i = k; i < n; ++i)
   {
      if (arr[i] < arr[0])
      {
     pop_heap(arr, arr + k, less<int>());
     arr[k - 1] = arr[i];
     push_heap(arr, arr + k, less<int>());
      }
   }

   vector<int> result(arr, arr + k);

   return result;
}


int main()
{
   const int n = 220000000;
   const int k = 300;

   srand (time(0));
   int* arr = new int[n];

   generate(arr, arr + n, RandomNumber);

   // replace with topk or topk2
   vector<int> result = find_topk2(arr, k, n);

   copy(result.begin(), result.end(), ostream_iterator<int>(cout, "\n"));


   return 0;
}

find_topk 's approach is to build a complete heap of size n, in O(n) and then remove the top element of heap k times O(log n). find_topk2 's approach is to build a heap of size k (O(k)) such that max element is at the top, and then from k to n, compare to see if any element is smaller than top element, and if so pop the top element, and push the new element which would mean n times O(log k). Both approach are written quite similarly so I don't believe any implementation detail (like creating temporaries etc.) can cause a difference besides the algo and the dataset (which is random).

I could actually profile the results of the benchmark and could see that find_topk actually called the comparison operator many more times than find_topk2. But I'm interested more in the reasoning of the theoretical complexity.. so two questions.

Disregarding the implementation or benchmark, was I wrong in expecting that O(n + k log n) should be better than O(n log k)? If I'm wrong, please explain why and how to reason such that I can see O(n log k) is actually better.
If I'm not wrong to expect no 1. Then why is my benchmark showing otherwise?

I hope you are aware of the nth_element/partial_sort algorithm and doing this only for fun. — Karoly Horvath, Jul 12 '11 at 19:31
`partial_sort`/`partial_sort_copy` have exactly the same running time as your `find_topk2` on this syntetic test — sehe, Jul 12 '11 at 19:40
yes I'm aware that find_topk2 is replacable by partial_sort. But I'm doing that way to make the algorithm even more explicit. — ryaner, Jul 12 '11 at 19:45

score 4 · Accepted Answer · answered Jul 12 '11 at 19:45

Big O in several variables is complex, since you need assumptions on how your variables scale with one another, so you can take unambiguously the limit to infinity.

If eg. k ~ n^(1/2), then O(n log k) becomes O(n log n) and O(n + k log n) becomes O(n + n^(1/2) log n) = O(n), which is better.

If k ~ log n, then O(n log k) = O(n log log n) and O(n + k log n) = O(n), which is better. Note that log log 2^1024 = 10, so the constants hidden in the O(n) may be greater than log log n for any realistic n.

If k = constant, then O(n log k) = O(n) and O(n + k log n) = O(n), which is the same.

But the constants play a big role: for instance, building a heap may involve reading the array 3 times, whereas building a priority queue of length k as you go only requires one pass through the array, and a small constant times log k for the lookup.

Which is "better" is therefore unclear, although my quick analysis tended to show that O(n + k log n) performs better under mild assumptions on k.

For instance, if k is a very small constant (say k = 3), then I'm ready to bet that the make_heap approach performs worse than the priority queue one on real world data.

Use asymptotic analysis wisely, and above all, profile your code before drawing conclusions.

score 1 · Answer 2 · answered Jul 12 '11 at 19:50

You are comparing two worst case upper bounds. For the first approach, the worst case is pretty much equal to the average case. For the second case, if the input is random, by the time you have passed more than a handful of items into the heap, the chance of throwing away the new value at once because it is not going to replace any of the top K is pretty high, so the worst case estimate for this is pessimistic.

If you are comparing wall clock time as opposed to comparisons you may find that heap based algorithms with large heaps tend not to win many races because they have horrible storage locality - and constant factors on modern microprocessors are heavily influenced by what level of memory you end up working in - finding your data is out in real memory chips (or worse, on disk) and not some level of cache will slow you down a lot - which is a shame because I really like heapsort.

score 0 · Answer 3 · answered Oct 09 '15 at 02:17

Keep in mind that you can now use std::nth_element instead of having to use a heap and do things yourself. Since the default comparator operator is std::less<>(), you can say something like this:

std::nth_element(myList.begin(), myList.begin() + k, myList.end());

Now, myList from positions 0 to k will be the smallest k elements.

Top K smallest selection algorithm - O (n + k log n) vs O (n log k) for k << N

3 Answers3