Performance of map vs starmap?

Question

I was trying to make a pure-python (without external dependencies) element-wise comparison of two sequences. My first solution was:

list(map(operator.eq, seq1, seq2))

Then I found starmap function from itertools, which seemed pretty similar to me. But it turned out to be 37% faster on my computer in worst case. As it was not obvious to me, I measured the time necessary to retrieve 1 element from a generator (don't know if this way is correct):

from operator import eq
from itertools import starmap

seq1 = [1,2,3]*10000
seq2 = [1,2,3]*10000
seq2[-1] = 5

gen1 = map(eq, seq1, seq2))
gen2 = starmap(eq, zip(seq1, seq2))

%timeit -n1000 -r10 next(gen1)
%timeit -n1000 -r10 next(gen2)

271 ns ± 1.26 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
208 ns ± 1.72 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)

In retrieving elements the second solution is 24% more performant. After that, they both produce the same results for list. But from somewhere we gain extra 13% in time:

%timeit list(map(eq, seq1, seq2))
%timeit list(starmap(eq, zip(seq1, seq2)))

5.24 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.34 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I don't know how to dig deeper in profiling of such nested code? So my question is why the first generator so faster in retrieving and from where we gain extra 13% in list function?

EDIT: My first intention was to perform element-wise comparison instead of all, so the all function was replaced with list. This replacement does not affect the timing ratio.

CPython 3.6.2 on Windows 10 (64bit)

@Błotosmętek thank you for correction! My first intention was element-wise comparision instead of `all`, which was not obvious from my question:) Really if you substitute `list` instead of `all` the order of times will be the same. — godaygo, Sep 12 '17 at 09:18

MSeifert · Accepted Answer · 2017-09-12T11:58:51.877

There are several factors that contribute (in conjunction) to the observed performance difference:

zip re-uses the returned tuple if it has a reference count of 1 when the next __next__ call is made.
map builds a new tuple that is passed to the "mapped function" every time a __next__ call is made. Actually it probably won't create a new tuple from scratch because Python maintains a storage for unused tuples. But in that case map has to find an unused tuple of the right size.
starmap checks if the next item in the iterable is of type tuple and if so it just passes it on.
Calling a C function from within C code with PyObject_Call won't create a new tuple that is passed to the callee.

So starmap with zip will only use one tuple over and over again that is passed to operator.eq thus reducing the function call overhead immensely. map on the other hand will create a new tuple (or fill a C array from CPython 3.6 on) every time operator.eq is called. So what is actually the speed difference is just the tuple creation overhead.

Instead of linking to the source code I'll provide some Cython code that can be used to verify this:

In [1]: %load_ext cython

In [2]: %%cython
   ...:
   ...: from cpython.ref cimport Py_DECREF
   ...:
   ...: cpdef func(zipper):
   ...:     a = next(zipper)
   ...:     print('a', a)
   ...:     Py_DECREF(a)
   ...:     b = next(zipper)
   ...:     print('a', a)

In [3]: func(zip([1, 2], [1, 2]))
a (1, 1)
a (2, 2)

Yes, tuples aren't really immutable, a simple Py_DECREF was sufficient to "trick" zip into believing noone else holds a reference to the returned tuple!

As for the "tuple-pass-thru":

In [4]: %%cython
   ...:
   ...: def func_inner(*args):
   ...:     print(id(args))
   ...:
   ...: def func(*args):
   ...:     print(id(args))
   ...:     func_inner(*args)

In [5]: func(1, 2)
1404350461320
1404350461320

So the tuple is passed right through (just because these are defined as C functions!) This doesn't happen for pure Python functions:

In [6]: def func_inner(*args):
   ...:     print(id(args))
   ...:
   ...: def func(*args):
   ...:     print(id(args))
   ...:     func_inner(*args)
   ...:

In [7]: func(1, 2)
1404350436488
1404352833800

Note that it also doesn't happen if the called function isn't a C function even if called from a C function:

In [8]: %%cython
   ...: 
   ...: def func_inner_c(*args):
   ...:     print(id(args))
   ...: 
   ...: def func(inner, *args):
   ...:     print(id(args))
   ...:     inner(*args)
   ...:

In [9]: def func_inner_py(*args):
    ...:     print(id(args))
    ...:
    ...:

In [10]: func(func_inner_py, 1, 2)
1404350471944
1404353010184

In [11]: func(func_inner_c, 1, 2)
1404344354824
1404344354824

So there are a lot of "coincidences" leading up to the point that starmap with zip is faster than calling map with multiple arguments when the called function is also a C function...

Ashwini Chaudhary · Answer 2 · 2017-09-12T11:10:20.677

2

One difference I can notice is the how map retrieves items from the iterables. Both map and zip create a tuple of iterators from each iterable passed. Now zip maintains a result tuple internally that is populated every time next is called and on the other hand, map creates a new array^* with each next call and deallocates it.

*_{As pointed out by MSeifert till 3.5.4 map_next used to allocate a new Python tuple everytime. This changed in 3.6 and till 5 iterables C stack is used and for anything larger than that heap is used. Related PRs: Issue #27809: map_next() uses fast call and Add _PY_FASTCALL_SMALL_STACK constant | Issue: https://bugs.python.org/issue27809}

edited Sep 12 '17 at 11:10

answered Sep 12 '17 at 10:01

Ashwini Chaudhary

244,495
58
464
504

That would assume that this is 3.6, note that the code in [3.5.4](https://github.com/python/cpython/blob/v3.5.4/Python/bltinmodule.c#L1168-L1192) looks different. :) – MSeifert Sep 12 '17 at 10:08
@MSeifert I wonder how much slow/fast 3.5.4 implementation was compared to 3.6.2. – Ashwini Chaudhary Sep 12 '17 at 10:55
(should be slow till 5 iterables because of heap vs C-Stack access) – Ashwini Chaudhary Sep 12 '17 at 11:01
Yes, you're correct. My "research" lead me to the same conclusion. – MSeifert Sep 12 '17 at 11:57

Performance of map vs starmap?

2 Answers2

Linked