I'm asking this in regards to Top K algorithm. I'd think that O(n + k log n) should be faster, because well.. for instance if you try plugging in k = 300 and n = 100000000 for example, we can see that O(n + k log n) is smaller.
However when I do a benchmark with C++, it's showing me that O (n log k) is more than 2x faster. Here's the complete benchmarking program:
#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <ctime>
#include <cstdlib>
using namespace std;
int RandomNumber () { return rand(); }
vector<int> find_topk(int arr[], int k, int n)
{
make_heap(arr, arr + n, greater<int>());
vector<int> result(k);
for (int i = 0; i < k; ++i)
{
result[i] = arr[0];
pop_heap(arr, arr + n - i, greater<int>());
}
return result;
}
vector<int> find_topk2(int arr[], int k, int n)
{
make_heap(arr, arr + k, less<int>());
for (int i = k; i < n; ++i)
{
if (arr[i] < arr[0])
{
pop_heap(arr, arr + k, less<int>());
arr[k - 1] = arr[i];
push_heap(arr, arr + k, less<int>());
}
}
vector<int> result(arr, arr + k);
return result;
}
int main()
{
const int n = 220000000;
const int k = 300;
srand (time(0));
int* arr = new int[n];
generate(arr, arr + n, RandomNumber);
// replace with topk or topk2
vector<int> result = find_topk2(arr, k, n);
copy(result.begin(), result.end(), ostream_iterator<int>(cout, "\n"));
return 0;
}
find_topk 's approach is to build a complete heap of size n, in O(n) and then remove the top element of heap k times O(log n). find_topk2 's approach is to build a heap of size k (O(k)) such that max element is at the top, and then from k to n, compare to see if any element is smaller than top element, and if so pop the top element, and push the new element which would mean n times O(log k). Both approach are written quite similarly so I don't believe any implementation detail (like creating temporaries etc.) can cause a difference besides the algo and the dataset (which is random).
I could actually profile the results of the benchmark and could see that find_topk actually called the comparison operator many more times than find_topk2. But I'm interested more in the reasoning of the theoretical complexity.. so two questions.
- Disregarding the implementation or benchmark, was I wrong in expecting that O(n + k log n) should be better than O(n log k)? If I'm wrong, please explain why and how to reason such that I can see O(n log k) is actually better.
- If I'm not wrong to expect no 1. Then why is my benchmark showing otherwise?