Return the top K elements from an input array

Question

I am looking for efficient way to return top k elements from an input array.
One way would be to sort the array and return the k elements from end of array.

There are other methods suggested here, one of which uses the quickselect algorithm, but from my understanding, quickselect returns only the k-th element in an unsorted array. After it returns, elements to left and right of k are still unsorted.

So it should be something like this:

while k>0{
   quickselect(int[] arr, k);
   k--;
}

Quickselect is O(n) and we do that for k times, so the overall time complexity is O(n*k).
But the data in the post suggest that this is better than O(n log n).
Selecting the top 200 from a million sample would mean 200 million in the former case but 20 million in the latter. Clearly this is far better.

Is my understanding how quickselect can be employed to select the top 200 elements correct?

have you tried a navigableset or treeset. With treeset all values will be sorted and you can use subset(start, end) to get top K. If you need to sort desc use a nagigablemap and return the top K in desc sort..... all this if you are using java — Mark Giaconia, Feb 20 '14 at 00:16
@markg: adding to treeset and returning top k elements will still be `nlog n` which is equivalent to sorting — brain storm, Feb 20 '14 at 00:18

Bernhard Barker · Accepted Answer · 2014-02-20T00:59:40.910

4

No, you don't need O(nk) time - it can be done in O(n) (average case).

The end result of the quickselect would be the k-th element at the k-th position from the end in the array, with smaller elements on the left, and larger elements on the right.

So all elements from that element to the right would be the k largest elements - it would be trivial to extract these in O(k) (with a simple for-loop), which would end up with a total running time of O(n).

Alternatively, since you'd know the k-th element after running quickselect, you could just go through the array again and extract all elements larger or equal to that element (this could be done in a single pass - also O(n)).

You'd need an additional O(k log k) (to sort them) if you want to return them in sorted order.

edited Feb 20 '14 at 00:59

answered Feb 20 '14 at 00:20

Bernhard Barker

54,589
14
104
138

sorry, quickselect returns the kth smallest number in an unsorted array, which means it returns the number that is at index k+1 in sorted array. so the k need not be middle of array correct? – brain storm Feb 20 '14 at 00:23
@user1988876: The quickselect algorithm will place the kth element at its final sorted position. Furthermore, it will place all the elements smaller than the kth left from it and all the elements larger than the kth largest right of it. – Niklas B. Feb 20 '14 at 00:25
and this method would modify the input array correct? – brain storm Feb 20 '14 at 00:26
@user1988876 "middle" might be a bit ambiguous, I modified my answer. This could either be implemented in-place (by modifying the input array) or by using extra storage (an additional array, or lists for the left and right parts). – Bernhard Barker Feb 20 '14 at 00:28
@Dukeling: If I want top k elements, I would extract elements greater or equal to kth element for k-1 times. For example, in a sorted list I want last k elements I mean..that is why it is top K correct? – brain storm Feb 20 '14 at 00:35
@user1988876 Yes, top k presumably means largest k. Edited my answer. – Bernhard Barker Feb 20 '14 at 00:44
1

so the time complexity is O(n) to for quickselect and return elements from (k to input.array.length), which would be another O(n). It will be O(n)+O(n) resulting in O(n) finally am I right? – brain storm Feb 20 '14 at 00:58
@user1988876 Yes, the final running time would be O(n). – Bernhard Barker Feb 20 '14 at 01:01

score 2 · Answer 2 · answered Feb 20 '14 at 01:03

Quickselect is good if n is not too large but if you have very large n (too large to fit in memory) or unknown n (say you are seeing an unbounded stream of samples and you want to be able to report the 200 largest seen so far at any point) then an alternative is to keep a k-element min-heap and every time you see a new element, compare it to the top of the min-heap (the smallest of the 200 largest elements so far) and if it is larger, pop the old top of the heap and push the new element onto the heap.

Return the top K elements from an input array

2 Answers2