heapq.nlargest
is always the correct answer when the question is "How do I get a small number of maximum values from a huge set of inputs?" It minimizes memory usage and CPU usage better than just about anything else you could do in Python, by using heaps. Example:
import heapq
from operator import itemgetter
n = 3
items = {'a': 7, 'b': 12, 'c': 9, 'd': 0, 'e': 24, 'f': 10, 'g': 24}
topitems = heapq.nlargest(n, items.items(), key=itemgetter(1)) # Use .iteritems() on Py2
topitemsasdict = dict(topitems)
sorted
and slicing the result can win when the number of max items requested is a large percentage of the input, but for huge inputs and small numbers of max items, the memory savings of heapq.nlargest
will win.
For the CS theory geeks, heapq.nlargest
, for an input of size n
, selecting the k
max values, requires O(n log k)
computation, and k
storage. sorted
followed by slicing requires O(n log n)
computation and n
storage. So for 1024 inputs and 4 selected items, work for nlargest
is ~1024 * 2 computation with storage required of 4; sorted
+ slicing would be ~1024 * 10 computation with storage of 1024. In practice, Python's TimSort used in sorted
has lower overhead than big-O notation can properly convey, and usually performs better than the big-O notation would indicate, which is why for, say, selecting the top 200 items out of 1024, sorted
+ slicing can still win, but nlargest
lacks pathological degradation for huge inputs and outputs; it may be slower on occasion, but it's usually not much slower, where sorted can be faster, but it can also be much slower.