3

I was asked this question in a recent Java interview.

Given a List containing millions of items, maintain a list of the highest n items. Sorting the list in descending order then taking the first n items is definitely not efficient due to the list size.

Below is what I did, I'd appreciate if anyone could provide a more efficient or elegant solution as I believe this could also be solved using a PriorityQueue:

public TreeSet<Integer> findTopNNumbersInLargeList(final List<Integer> largeNumbersList, 
final int highestValCount) {

    TreeSet<Integer> highestNNumbers = new TreeSet<Integer>();

    for (int number : largeNumbersList) {
        if (highestNNumbers.size() < highestValCount) {
            highestNNumbers.add(number);
        } else {
            for (int i : highestNNumbers) {
                if (i < number) {
                    highestNNumbers.remove(i);
                    highestNNumbers.add(number);
                    break;
                }
            }
        }
    }
    return highestNNumbers;
}
Luiggi Mendoza
  • 85,076
  • 16
  • 154
  • 332
gsdev
  • 259
  • 4
  • 14
  • It would be more efficient to implement a simple, bounded, ordered, circular buffer - maybe based on a `TreeSet`. Then simply use that to maintain the top X. – Boris the Spider Dec 30 '14 at 16:45
  • 2
    what if you have duplicate elements? – SMA Dec 30 '14 at 16:49
  • first i would ask how the list is contained is it random or what, if it has some particular order to it you can use that to slice the list and what not to fit what they want. for loops nested aren't efficient I think since you may then have `1,000,000^3` – jgr208 Dec 30 '14 at 16:52

4 Answers4

6

The for loop at the bottom is unnecessary, because you can tell right away if the number should be kept or not.

TreeSet lets you find the smallest element in O(log N)*. Compare that smallest element to number. If the number is greater, add it to the set, and remove the smallest element. Otherwise, keep walking to the next element of largeNumbersList.

The worst case is when the original list is sorted in ascending order, because you would have to replace an element in the TreeSet at each step. In this case the algorithm would take O(K log N), where K is the number of items in the original list, an improvement of logNK over the solution of sorting the array.

Note: If your list consists of Integers, you could use a linear sorting algorithm that is not based on comparisons to get the overall asymptotic complexity to O(K). This does not mean that the linear solution would be necessarily faster than the original for any fixed number of elements.

* You can maintain the value of the smallest element as you go to make it O(1).

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • yes this is the best answer I think anyone could give, honestly in this situation I think the person listening to your answer should care more about the logic and algorithm as well as time complexity then the syntax and such. – jgr208 Dec 30 '14 at 16:57
  • Any chance you could elaborate on your O(1) solution? – gsdev Dec 30 '14 at 17:30
  • Radix sorting and friends aren't necessary for O(1) per element; see my answer. – Louis Wasserman Dec 30 '14 at 17:46
  • @gsdev The smallest element among top `N` can only go up when you insert a new element into the tree set (an `O(Log N)` operation). If you get the smallest element from the top N at the same time (a second `O(Log N)` operation) you can store the value in a separate variable, so you can get it for `O(1)` on subsequent iterations. Note that this optimization does not change the asymptotic complexity in the worst case, but the improvement could be considerable with some favorable ordering. For example, if the original list is in descending order, the last K-N checks complete in O(1). – Sergey Kalinichenko Dec 30 '14 at 17:59
4

You don't need nested loops, just keep inserting and remove the smallest number when the set is too large:

public Set<Integer> findTopNNumbersInLargeList(final List<Integer> largeNumbersList, 
  final int highestValCount) {

  TreeSet<Integer> highestNNumbers = new TreeSet<Integer>();

  for (int number : largeNumbersList) {
    highestNNumbers.add(number);
    if (highestNNumbers.size() > highestValCount) {
      highestNNumbers.pollFirst();
    }
  }
  return highestNNumbers;
}

The same code should work with a PriorityQueue, too. The runtime should be O(n log highestValCount) in any case.

P.S. As pointed out in the other answer, you can optimize this some more (at the cost of readability) by keeping track of the lowest number, avoiding unnecessary inserts.

Kick Buttowski
  • 6,709
  • 13
  • 37
  • 58
Stefan Haustein
  • 18,427
  • 3
  • 36
  • 51
4

It's possible to support amortized O(1) processing of new elements and O(n) querying of the current top elements as follows:

Maintain a buffer of size 2n, and whenever you see a new element, add it to the buffer. When the buffer gets full, use quick select or another linear median finding algorithm to select the current top n elements, and discard the rest. This is an O(n) operation, but you only need to perform it every n elements, which balances out to O(1) amortized time.

This is the algorithm Guava uses for Ordering.leastOf, which extracts the top n elements from an Iterator or Iterable. It is fast enough in practice to be quite competitive with a PriorityQueue based approach, and it is much more resistant to worst case input.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
0

I would start by saying that your question, as stated, is impossible. There is no way to find the highest n items in a List without fully traversing it. And there is no way to fully traverse an infinite List.

That said, the text of your question differs from the title. There is a massive difference between very large and infinite. Please bear that in mind.

To answer the feasible question, I would begin by implementing a buffer class to encapsulate the behaviour of keeping the top N, lets call it TopNBuffer:

class TopNBuffer<T extends Comparable<T>> {
    private final NavigableSet<T> backingSet = new TreeSet<>();

    private final int limit;

    public TopNBuffer(int limit) {
        this.limit = limit;
    }

    public void add(final T t) {
        if (backingSet.add(t) && backingSet.size() > limit) {
            backingSet.pollFirst();
        }
    }

    public SortedSet<T> highest() {
        return Collections.unmodifiableSortedSet(backingSet);
    }
}

All we do here is to, on add, if the number is not unique, and adding the number makes the Set exceeds its limit, then we simply remove the lowest element from the Set.

The method highest gives an unmodifiable view of the current highest elements. So, in Java 8 syntax, all you need to do is:

final TopNBuffer<Integer> topN = new TopNBuffer<>(n);
largeNumbersList.foreach(topN::add);
final Set<Integer> highestN = topN.highest();

I think in an interview environment, its not enough to simply whack lots of code into a method. Demonstrating an understanding of OO programming and separation of concerns is also important.

Boris the Spider
  • 59,842
  • 6
  • 106
  • 166
  • Given a List containing millions of items, maintain a list of the highest n items. Sorting the list in descending order then taking the first n items is definitely not efficient due to the list size. – Kick Buttowski Dec 30 '14 at 17:04
  • @Kick That's exactly what this code does. Except it's a `Set` to adding to the highest n is `O(lg n)` rather than `O(n)` as it would be with a `List`. Please read and understand the code before commenting. – Boris the Spider Dec 30 '14 at 17:05
  • Hi Kick, the question was asked during a Java interview I had. The task is to maintain the top n numbers added to a very large list. – gsdev Dec 30 '14 at 17:35
  • @gsdev it looks like an awesome question. I am preparing for Google too, but the question is not that clear to me. can you explain better? so what I understood is sort the list without using sorting method? when u remove the smallest number where they go? can u explain in detail plz? – Kick Buttowski Dec 30 '14 at 17:40
  • 1
    @Kick, The interviewer asked me to maintain the top n items in a list containing millions of entries. They pointed out that due to the size of the list, sorting the list and then taking the first n items was not a valid answer. They wanted me to solve the problem in the most efficient way. – gsdev Dec 30 '14 at 18:40