What is the most efficient way to ge the top N items of the union of M sorted sets

Question

Say you have 4 sorted sets with thousands and thousands of keys and scores. Since they are sorted sets, getting the top items can ben done in logaritmic time complexity.

The easy way would be to take the union of the sets, and then get the top items. But doing so is at least linear to the sum of all items in all sets.

The best way I could think of is this:

Take the top N items from every set
Find the item with the lowest rank and the higest score for that rank.
Devide that score by the number of sets. (Any key with a score lower than this can never be in the top N)
Take the union of those keys. (Ignoring scores)
Find the scores for all keys in all sets. (A key might have score 1 in one set and 10000 in another)

That is like, finding all keys that could possibly be in the top list, and do the union with those keys. There are probably more efficient ways to limit the number of items to consider.

[edit] Keys occur in one or more sets, and their summed scores determines the final score. So a key that is in all sets with a low score might have a higher score than a key with a high score that is in only one set.

An optimization of the naive method would be to take the top `s` items from each set, and just union them. That is, if you want the top 10 items, take the top 10 from each set, union, sort, and take the top 10 from those 40. If you know that `s` is very small in comparison to the total number of items that simple optimization could very well be all you need to make this perform well. Otherwise, use the simple merge with a priority queue. — Jim Mischel, Jun 10 '14 at 18:30
No the keys are not unique. A key might be in multiple lists, and the sum of its scores determines the final score. So taking the top s items doesn't cut it. — Pepijn, Jun 12 '14 at 19:42
The only reasonable way to do this is with a merge to sum the scores, feeding into a priority queue for selection. The merge is `O(n log 4)`, and the selection is `O(m log k)`, where `m` is the number of unique items and `k` is the number of items you want to select. This requires O(k) extra space. You could do the merge and then use the Quickselect algorithm (which is O(n)) to do the selection, but that will require O(m) extra space. Also, heap select outperforms Quickselect when `k` is a small percentage (1 to 2 percent) of `m`. — Jim Mischel, Jun 12 '14 at 22:36

gexicide · Answer 1 · 2014-06-10T11:04:00.993

The algorithm you propose seems quite awkward. Just take one of the following:

The simple way

for i = 1 to n
    loop through all sets and look at their smallest element,
    pick the smallest element and remove it from the sets

Complexity: O(n * s) where n is the number of items you want and s is the number of sets.

Of course, if you are not allowed to remove elements from the sets, you can also maintain iterators into each set to get elements from them in sorted order without having to alter the sets.

A more efficient way

Maintain a priority queue over all the smallest elements of each set. Whenever removing the smallest element e from that priority queue, reinsert the next element from the set from which e came.

Complexity: Assume a simple priority queue with O(log n) 'insert' and O(log n) 'remove smallest element' complexity. There are better ones like fibonacci heaps, but this one will do just fine. Then we have:

s insertions to fill the priority queue at the start, so O(s log s).
n "delete smallest element" + insert a new one, so O(n log s) (since there are always s elements in the queue)

Thus, we achieve O(s log s + n log s) which is way better.

Comparison

As long as s is quite small, there shouldn't really be a big difference between the algorithms and you can also pick the simple one. If you have a lot of sets, then you should definitely go for the second approach.

Lookup Complexity

In my analysis, I omitted the logarithmic lookup factor to find the smallest element for each set and assumed that the smallest element of each set could be retrieved in O(1), like in a sorted list. Varying the lookup cost from O(1) to O(log n) just introduces an additional factor that does not alter the algorithms. In addition, you usueally only pay the O(log n) once at the first lookup. Afterwards, you usually have an iterator to the smallest element. Accessing each further element using the iterator is then only O(1).

Instead of removing the elements (which is usually a complex task), we could just advance a pointer in the list. Btw, this is a variation of merge-sort. — Nico Schertler, Jun 10 '14 at 10:33
@NicoSchertler: RIght, but it is inefficient, check my solution below — gexicide, Jun 10 '14 at 10:34
Well, it's questionable if a priority with four entries is really faster than hard-coding the comparisons. — Nico Schertler, Jun 10 '14 at 10:36
My solution might seem awkward, but yours seems to do a different thing entirely. Yours takes the minimum rather than the maximum, and does not account for keys that occur in multiple sets. — Pepijn, Jun 12 '14 at 19:50

What is the most efficient way to ge the top N items of the union of M sorted sets

1 Answers1

The simple way

A more efficient way

Comparison

Lookup Complexity

Linked