Why is using a key function so much slower?

Question

There is a drastic performance hit when using a keyfunc in heapq.nlargest:

>>> from random import random
>>> from heapq import nlargest
>>> data = [random() for _ in range(1234567)]
>>> %timeit nlargest(10, data)
30.2 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit nlargest(10, data, key=lambda n: n)
159 ms ± 6.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I expected a small extra cost, perhaps something like 30% - not 400%. This degradation seems to be reproducible over a few different data sizes. You can see in the source code there is a special-case handling for if key is None, but otherwise the implementation looks more or less the same.

Why is performance so degraded by using a key function? Is it only due to the extra function call overhead, or is the algorithm fundamentally changed somehow by using a keyfunc?

For comparison, sorted takes about a 30% hit with the same data and lambda.

Trying to run this with line_profiler suggests that it really should be about a 30% hit (the iteration, lambda, and comparison each take up about 30% of the time). So, I guess in this case the extra overhead of running the profiler is changing things. — user545424, Jun 06 '18 at 19:03

user2357112 · Answer 1 · 2018-06-06T19:49:30.600

The extra overhead of calling lambda n: n so many times is really just that expensive.

In [17]: key = lambda n: n

In [18]: x = [random() for _ in range(1234567)]

In [19]: %timeit nlargest(10, x)
33.1 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [20]: %timeit nlargest(10, x, key=key)
133 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [21]: %%timeit
    ...: for i in x:
    ...:     key(i)
    ...: 
93.2 ms ± 978 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [22]: %%timeit
    ...: for i in x:
    ...:     pass
    ...: 
10.1 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can see, the cost of calling key on all the elements accounts for almost the entirety of the overhead.

Key evaluations are equally expensive for sorted, but because the total work of sorting is more expensive, the overhead of key calls is a smaller percentage of the total. You should have compared the absolute overhead of using a key with nlargest or sorted, rather than the overhead as a percentage of the base.

In [23]: %timeit sorted(x)
542 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]: %timeit sorted(x, key=key)
683 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you can see, the cost of key calls accounts for about half the overhead of using this key with sorted on this input, the rest of the overhead probably coming from the work of shuffling more data around in the sort itself.

You might wonder how nlargest manages to do so little work per element. For the no-key case, most iteration happens in the following loop:

for elem in it:
    if top < elem:
        _heapreplace(result, (elem, order))
        top = result[0][0]
        order -= 1

or for the case with a key:

for elem in it:
    k = key(elem)
    if top < k:
        _heapreplace(result, (k, order, elem))
        top = result[0][0]
        order -= 1

The crucial realization is that the top < elem and top < k branches are almost never taken. Once the algorithm has found 10 fairly large elements, most of the remaining elements are going to be smaller than the 10 current candidates. On the rare occasions where a heap element needs to be replaced, that just makes it even harder for further elements to pass the bar needed to call heapreplace.

On a random input, the number of heapreplace calls nlargest makes is expected logarithmic in the size of the input. Specifically, for nlargest(10, x), aside from the first 10 elements of x, element x[i] has a 10/(i+1) probability of being in the top 10 elements of l[:i+1], which is the condition necessary for a heapreplace call. By linearity of expectation, the expected number of heapreplace calls is the sum of these probabilities, and that sum is O(log(len(x))). (This analysis holds with 10 replaced by any constant, but a slightly more sophisticated analysis is needed for a variable n in nlargest(n, l).)

The performance story would be very different for a sorted input, where every element would pass the if check:

In [25]: sorted_x = sorted(x)

In [26]: %timeit nlargest(10, sorted_x)
463 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Over 10 times as expensive as the unsorted case!

Looks right. I should have clicked that `nlargest(1234567, data)` is essentially a sort (with reverse=True), and tried that. Now I see `max` and `min` with keyfunc take the same huge hit. — wim, Jun 06 '18 at 19:04

score 8 · Accepted Answer · answered Jun 06 '18 at 19:09

Say your iterable has N elements. Whether sorting or doing nlargest, the key function will be called N times. When sorting, that overhead is largely buried under roughly N * log2(N) other operations. But when doing nlargest of k items, there are only roughly N * log2(k) other operations, which is much smaller when k is much smaller than N.

In your example, N = 1234567 and k = 10, and so the ratio of other operations, sorting over nlargest, is roughly:

>>> log2(1234567) / log2(10)
6.0915146640862625

That this is close to 6 is purely coincidence ;-) It's the qualitative point that matters: the overhead of using a key function is much more significant for nlargest than for sorting randomly ordered data, provided k is much smaller than N.

In fact, that greatly understates the relative burden for nlargest, because the O(log2(k)) heapreplace is called in the latter only when the next element is larger than the k'th largest seen so far. Most of the time it isn't, and so the loop on such an iteration is nearly pure overhead, calling a Python-level key function just to discover that the result isn't interesting.

Quantifying that is beyond me, though; for example, on my Win10 box under Python 3.6.5, I only see a timing difference in your code a bit less than a factor of 3. That doesn't surprise me - calling a Python-level function is much more expensive than poking a list iterator and doing an integer compare (both "at C speed").

@wim: No, the iterator is advanced by the list comprehension, so the subsequent loop doesn't re-consider those elements. — user2357112, Jun 06 '18 at 19:17
Nope. It's called exactly once per input element. In `for elem in it:`, the iterator (`it`) has already advanced beyond the first `k` elements thanks to the earlier `zip`. — Tim Peters, Jun 06 '18 at 19:18

Why is using a key function so much slower?

2 Answers2