13

CPython 3.6.4:

from functools import partial

def add(x, y, z, a):
    return x + y + z + a

list_of_as = list(range(10000))

def max1():
    return max(list_of_as , key=lambda a: add(10, 20, 30, a))

def max2():
    return max(list_of_as , key=partial(add, 10, 20, 30))

now:

In [2]: %timeit max1()
4.36 ms ± 42.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit max2()
3.67 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I thought partial just remembers part of parameters and then forwards them to the original function when called with the rest of the parameters (so it's nothing more than a shortcut), but it seems it makes some optimization. In my case the whole max2 function gets optimized by 15% compared to the max1, which is pretty nice.

It would be great to know what the optimization is, so I could use it in a more efficient way. Docs are silent regarding any optimization. Not surprisingly, "roughly equivalent to" implementation (given in docs), does not optimize at all:

In [3]: %timeit max2()  # using `partial` implementation from docs 
10.7 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
MSeifert
  • 145,886
  • 38
  • 333
  • 352
pawelswiecki
  • 562
  • 1
  • 4
  • 14
  • did not know about partial, thanks for bringing it to my attention. As a hazarded guess: your overhead of creating new stack frames is less with partial involved as it only needs to store one variable instead of all four -but thats just a hazy (and most probably wrong) guess. I am waiting for the pros to pipe in for an in depth explanation :) – Patrick Artner Apr 22 '18 at 12:20
  • Any difference if instead of a lambda, you use a ‘def’? – Michal Charemza Apr 22 '18 at 12:22
  • 1
    @MichalCharemza I just checked (created an external `def helper(a): return add(10, 20, 30, a)` and used it in `max`) and there is no difference in speed. – pawelswiecki Apr 22 '18 at 12:27
  • Did you try partial implementation from python sources? (https://github.com/python/cpython/blob/master/Lib/functools.py#L234) – awesoon Apr 22 '18 at 12:42
  • @soon I just did (by simply copying `class partial` into my code) and it's ever worse than "roughly equivalent to" one: `17.7 ms ± 595 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)` – pawelswiecki Apr 22 '18 at 12:47
  • 1
    There are also C sources for `partial`: https://github.com/python/cpython/blob/3070b71e5eedf62e49b8e7dedab75742a5f67ece/Modules/_functoolsmodule.c – awesoon Apr 22 '18 at 12:50
  • And seems like `try / catch` just after python's implementation shall overwrite python's class with C based implementation – awesoon Apr 22 '18 at 12:50
  • @soon Right. With this line added I'm back with `3.67 ms +/-`. Unfortunately I'm no so good with C to quickly (enough) understand the optimization that is going on there. – pawelswiecki Apr 22 '18 at 12:53
  • I think the optimization is simply that it is written in C, not something special it does in the C implementation. Python-level function calls are pretty expensive, doing them from C is faster. – Blckknght Apr 22 '18 at 12:58
  • Well, I am also not an expert in C, but I think the answer you are looking for should be in the [`partial_fastcall`](https://github.com/python/cpython/blob/3070b71e5eedf62e49b8e7dedab75742a5f67ece/Modules/_functoolsmodule.c#L130) function implementation. Looks like it is optimized for functions with a small number of arguments. Though I may be wrong – awesoon Apr 22 '18 at 13:03
  • 1
    One difference is that `max1` needs to lookup `add` in the global scope for each call while `max2` only does the lookup once. Adding `add=add` as a lambda parameter will make it faster. Alternatively, you can put it in a local variable of `max1`. – interjay Apr 22 '18 at 14:10
  • @interjay Adding `add=add` to `max1`s parameters did improve the speed but not as much: `4.5 ms ± 28.3 µs per loop (mean ± std. dev. of 20 runs, 100 loops each)` (old max1) vs `4.41 ms ± 70.6 µs per loop (mean ± std. dev. of 20 runs, 100 loops each)` (new max1). I increased timeit's repetitions from 7 to 20. – pawelswiecki Apr 22 '18 at 14:48

1 Answers1

12

The following arguments actually apply only to CPython, for other Python implementations it could be completely different. You actually said your question is about CPython but nevertheless I think it's important to realize that these in-depth questions almost always depend on implementation details that might be different for different implementations and might even be different between different CPython versions (for example CPython 2.7 could be completely different, but so could be CPython 3.5)!

Timings

First of all, I can't reproduce differences of 15% or even 20%. On my computer the difference is around ~10%. It's even less when you change the lambda so it doesn't have to look up add from the global scope (as already pointed out in the comments you can pass the add function as default argument to the function so the lookup happens in the local scope).

from functools import partial

def add(x, y, z, a):
    return x + y + z + a

def max_lambda_default(lst):
    return max(lst , key=lambda a, add=add: add(10, 20, 30, a))

def max_lambda(lst):
    return max(lst , key=lambda a: add(10, 20, 30, a))

def max_partial(lst):
    return max(lst , key=partial(add, 10, 20, 30))

I actually benchmarked these:

enter image description here

from simple_benchmark import benchmark
from collections import OrderedDict

arguments = OrderedDict((2**i, list(range(2**i))) for i in range(1, 20))
b = benchmark([max_lambda_default, max_lambda, max_partial], arguments, "list size")

%matplotlib notebook
b.plot_difference_percentage(relative_to=max_partial)

Possible explanations

It's very hard to find the exact reason for the difference. However there are a few possible options, assuming you have a CPython version with compiled _functools module (all desktop versions of CPython that I use have it).

As you already found out the Python version of partial will be significantly slower.

  • partial is implemented in C and can call the function directly - without intermediate Python layer1. The lambda on the other hand needs to do a Python level call to the "captured" function.

  • partial actually knows how the arguments fit together. So it can create the arguments that are passed to the function more efficiently (it just concatenats the stored argument tuple to the passed in argument tuple) instead of building a completely new argument tuple.

  • In more recent Python versions several internals were changed in an effort to optimize function calls (the so called FASTCALL optimization). Victor Stinner has a list of related pull requests on his blog in case you want to find out more about it.

    That probably will affect both the lambda and the partial but again because partial is a C function it knows which one to call directly without having to infer it like lambda does.

However it's very important to realize that creating the partial has some overhead. The break-even point is for ~10 list elements, if the list is shorter, then the lambda will be faster.

Footnotes

1 If you call a function from Python it uses the OP-code CALL_FUNCTION which is actually a wrapper (that's what I meant with Python layer) around the PyObject_Call* (or FASTCAL) functions. But it also includes creating the argument tuple/dictionary. If you call a function from a C function you can avoid this thin wrapper by directly calling the PyObject_Call* functions.

In case you're interested about the OP-Codes, you can disassemble the function:

import dis
    
dis.dis(max_lambda_default)

 0 LOAD_GLOBAL              0 (max)
 2 LOAD_FAST                0 (lst)
 4 LOAD_GLOBAL              1 (add)
 6 BUILD_TUPLE              1
 8 LOAD_CONST               1 (<code object <lambda>>)
10 LOAD_CONST               2 ('max_lambda_default.<locals>.<lambda>')
12 MAKE_FUNCTION            1 (defaults)
14 LOAD_CONST               3 (('key',))
16 CALL_FUNCTION_KW         2
18 RETURN_VALUE

Disassembly of <code object <lambda>>:
 0 LOAD_FAST                1 (add)      <--- (2)
 2 LOAD_CONST               1 (10)
 4 LOAD_CONST               2 (20)
 6 LOAD_CONST               3 (30)
 8 LOAD_FAST                0 (a)
10 CALL_FUNCTION            4            <--- (1)
12 RETURN_VALUE

As you can see the CALL_FUNCTION op code (1) is actually in there.

As an aside: The LOAD_FAST (2) is responsible for the performance difference between the lambda_default and the lambda without default (which has to resort to a slower lookup). That's because loading a name actually starts by checking the local scope (the function scope), in the case of add=add the add function is in the local scope, so it can make a faster lookup. If you don't have it in the local scope it will check each surrounding scope until it finds the name and it only stops when it reaches the global scope. And that lookup is done every time the lambda is called!

MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • Isn't `functools.partial` implemented in Python since Python 3.4? https://github.com/python/cpython/blob/master/Lib/functools.py#L234 – Peter Nimroot Apr 22 '18 at 15:22
  • 2
    @PeterNimroot That's just a fallback in case you don't have the compiled `_functools` module ([the class is overwritten by an import shortly after](https://github.com/python/cpython/blob/master/Lib/functools.py#L312-L315)). And it's very rare that you don't have that. – MSeifert Apr 22 '18 at 15:24
  • @MSeifert small precision wrt name resolution: `LOAD_NAME` is because `compile` has no idea what the scopes are here, if you `dis.dis(lambda a: add(10, 20, 30, a)`) you'll find that the function body uses the specialised `LOAD_GLOBAL` and `LOAD_FAST` rather than the more general `LOAD_NAME`. `LOAD_FAST` can directly index into the frame's local namespace instead of having to go through looking up and resolving the name making it especially fast. With `lambda a, add=add: add(10, 20, 30, a)` the `add` is also `LOAD_FAST`-ed, which is why this form is noticeably faster (& a common optimisation). – Masklinn Mar 23 '21 at 13:58
  • @Masklinn That's kind of what I meant. The `LOAD_NAME` makes it slower - what happenes if you don't use it as default. However the answer is 3 years old now ... so probably a few things have changed in the meantime. – MSeifert Mar 23 '21 at 18:23
  • @MSeifert the `LOAD_NAME` would not be present in the lambda though, one would `LOAD_GLOBAL` the `add` while the other would `LOAD_LOCAL` it. The presence of `LOAD_NAME` is an artefact of dis-ing a string. I don't think this has changed in a while. – Masklinn Mar 23 '21 at 19:38
  • @Masklinn Ah I understand what you're saying now and have updated the answer. – MSeifert Mar 24 '21 at 08:12