5

Consider the task of finding the top-k elements in a set of N independent and identically distributed floating point values. By using a priority queue / heap, we can iterate once over all N elements and maintain a top-k set by the following operations:

  • if the element x is "worse" than the heap's head: discard x ⇒ complexity O(1)

  • if the element x is "better" than the heap's head: remove the head and insert x ⇒ complexity O(log k)

The worst case time complexity of this approach is obviously O(N log k), but what about the average time complexity? Due to the iid-assumption, the probability of the O(1) operation increases over time, and we rarely have to perform the costly O(log k), especially for k << N.

Is this average time complexity documented in any citable reference? What's the average time complexity? If you have a citeable reference for your answer please include it.

bluenote10
  • 23,414
  • 14
  • 122
  • 178
  • IMO for k << N, the complexity will asymptotically approach O(N). – Abhishek Bansal Dec 20 '13 at 15:55
  • I'm fairly sure asking for a 'citable reference' classifies as a recommendation question, which is off topic for [so], as per the [help/on-topic]. Feel free to change your question appropriately. – Bernhard Barker Dec 20 '13 at 16:00
  • 1
    @Dukeling: I'm not asking for a recommendation. Should I modify the question in a way that it has a unique answer? For instance, by asking for the _first_ publication, which contains this result? To me, the question is more whether such a reference exists at all. – bluenote10 Dec 20 '13 at 16:10
  • 1
    The request for a citeable reference is not on topic for this network. It's fine to ask the question of 'how do I do this/find this/what is this', but if you're really asking for research help, it's not appropriate. – Joe Dec 20 '13 at 17:08
  • 2
    Meta discussion [here](http://meta.stackexchange.com/q/212944/212780). – Geobits Dec 20 '13 at 17:20
  • 1
    Rather than asking for a citation, why's the question not just "what's the average time complexity?" It's not that hard to figure out from first principles (eg: see my answer). – Paul Hankin Dec 21 '13 at 14:00
  • @AbhishekBansal, I am wondering, it might be O(log(n)) in case of k is far less than n, actually, I think the complexity is k*log(n/k), take k is 1, then comes log(n), if k is n, then comes 1, considering we split those n into groups, each group with k numbers max, as less groups as possible, then we have approximately g=n/k groups, two groups together we try to extract top-k, what we have is k elements in order, apart from first time, this costs O(k), repeat, every 2 groups comes to 1 group, repeat this log(g) times, then we have O(k*log(g)), which is O(k*log(n/k)) – http8086 Mar 31 '21 at 08:40

1 Answers1

3

Consider the i'th largest element, and a particular permutation. It'll inserted into the k-sized heap if it appears before no more than k-1 of the (i - 1) larger elements in the permutation.

The probability of that heap-insertion happening is 1 if i <= k, and k/i if i > k.

From this, you can compute the expectation of the number of heap adjustments, using linearity of expectation. It's sum(i = 1 to k)1 + sum(i = k+1 to n)k/i = k + sum(i = k+1 to n)k/i = k * (1 + H(n) - H(k)), where H(n) is the n'th harmonic number.

This is approximately k log(n) (for k << n), and you can compute your average cost from there.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Paul Hankin
  • 54,811
  • 11
  • 92
  • 118