Why does enumerate execute slower when not specifying the keyword start?

Question

I noticed the following odd behaviour when timing enumerate with the default start parameter specified:

In [23]: %timeit enumerate([1, 2, 3, 4])
The slowest run took 7.18 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 511 ns per loop

In [24]: %timeit enumerate([1, 2, 3, 4], start=0)
The slowest run took 12.45 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.22 µs per loop

So, approximately a 2x slowdown for the case where start is specified.

The byte code issued for each case doesn't really indicate anything that would contribute to the significant difference in speed. Case in point, after examining the different calls with dis.dis the additional commands issued are:

18 LOAD_CONST               5 ('start')
21 LOAD_CONST               6 (0)

These, along with the CALL_FUNCTION having 1 keyword, are the only differences.

I tried tracing through the calls made in CPythons ceval with gdb and both seem to use do_call in call_function and not some other optimization I could detect.

Now, I understand enumerate just creates an enumerate iterator, so we're dealing with object creation here (right?). I looked in Objects/enumobject.c trying to spot any differences if start was specified. The only thing that (I believe) differs is when start != NULL in which the following happens:

if (start != NULL) {
    start = PyNumber_Index(start);
    if (start == NULL) {
        Py_DECREF(en);
        return NULL;
    }
    assert(PyInt_Check(start) || PyLong_Check(start));
    en->en_index = PyInt_AsSsize_t(start);
    if (en->en_index == -1 && PyErr_Occurred()) {
        PyErr_Clear();
        en->en_index = PY_SSIZE_T_MAX;
        en->en_longindex = start;
    } else {
        en->en_longindex = NULL;
        Py_DECREF(start);
    }

Which doesn't look like something which would introduce a 2x slowdown. (I think, not sure.)

The previous code segments have been executed on Python 3.5, similar results are present in 2.x too, though.

This is where I'm stuck and can't figure out where to look. This might just be a case of overhead from additional calls in the second case accumulating, but again, I'm not really sure. Does anybody know what might be the reason behind this?

Have you measured differences for bigger input sizes? It's much more likely that it's a const factor difference, rather than 2x speed regression. — Łukasz Rogalski, Jan 08 '16 at 12:37
@Rogalski: We're only creating the iterator, not iterating over it. The speed doesn't depend on input size at all. — user2357112, Jan 08 '16 at 12:41
Apologies for the title, it is a bit misleading. As for the input size, what @user2357112 said. — Ariadni, Jan 08 '16 at 12:45
@user2357112 I was sure enumerate is not lazy in Python2 (like map and so on), my bad. Obviously my comment is redundant - constant time operation will indeed be observed as `n` times worse. — Łukasz Rogalski, Jan 08 '16 at 12:49

Mazdak · Answer 1 · 2016-01-08T13:09:03.190

6

One reason might be because of calling the PyNumber_Index while you specify a start in following part :

if (start != NULL) {
    start = PyNumber_Index(start);

And If you take a look at PyNumber_Index function in abstract.c module you'll see the following comment at the top level of function:

/* Return a Python int from the object item.
   Raise TypeError if the result is not an int
   or if the object cannot be interpreted as an index.
*/

So this function has to check if the object cannot be interpreted as an index and will returns the relative errors. And if you look at the source carefully you'll see the all of this checking and referencing, specially in following part which has to do a nested structure dereference in order to checking the index type:

result = item->ob_type->tp_as_number->nb_index(item);
if (result &&
     !PyInt_Check(result) && !PyLong_Check(result)) {
                         ...

Would takes much time to check and return a desire result.

But as @ user2357112 mentioned, another and most important reason is because of the python keyword argument matching.

If you time-it the function without keyword argument you'll see the difference time will decrease approximately ~2X time:

~$ python -m timeit "enumerate([1, 2, 3, 4])"
1000000 loops, best of 3: 0.251 usec per loop
~$ python -m timeit "enumerate([1, 2, 3, 4],start=0)"
1000000 loops, best of 3: 0.431 usec per loop
~$ python -m timeit "enumerate([1, 2, 3, 4],0)"
1000000 loops, best of 3: 0.275 usec per loop

The difference with positional argument is:

>>> 0.251 - 0.275
-0.024

Which seems that is because of PyNumber_Index.

edited Jan 08 '16 at 13:09

answered Jan 08 '16 at 12:27

Mazdak

105,000
18
159
188

6

Nah, try specifying `start` as a positional argument, and the speed difference disappears. I'd say the keyword argument dict is the culprit. – user2357112 Jan 08 '16 at 12:37
@user2357112 Let me give it a try. – Mazdak Jan 08 '16 at 12:42
Not really sure about this. Specifying a simple function with `def foo(par=0): pass` and timing it's different calls yields: `foo() -> 164ns` and `foo(par=0) -> 183ns`. Any thoughts? – Ariadni Jan 08 '16 at 13:02
@Ariadni I think its because of built in functions are implemented in C and when you give it a keyword argument it first of all python tries to pars the args at interpreter level then pass it to C level – Mazdak Jan 08 '16 at 13:07
@Ariadni: when you use `foo`, the timings are "shadowed" by your user function, while `enumerate` is a builtin. Look at the absolute delta, you should have similar numberes – Andrea Corbellini Jan 08 '16 at 13:09
@Ariadni: `PyArg_ParseTupleAndKeywords` is unusually slow when it needs to look for keywords partially because, in C, the keywords are stored as C-style strings. So every single call requires it to convert the internal keywords that aren't passed positionally to full Python `str` objects (scanning the C-style string constant to get length, allocating a new `str`, filling it in, hashing it, looking it up in the `kwds` `dict`). In addition, the `CALL_FUNCTION` op code has a special fast path for C built-ins called without keyword arguments that's messing you up. – ShadowRanger Jan 12 '16 at 19:05
1

@Ariadni: For reference, [the code that `CALL_FUNCTION` is delegating to](https://hg.python.org/cpython/file/2.7/Python/ceval.c#l4309). You'll notice the first `if` check is checking for `PyCFunction`s (that is, CPython functions implemented in C) where `nk == 0` (`nk` is number of keywords). If no keywords are passed to multiarg function, it uses the fast path through `PyCFunction_Call`, and that path doesn't even construct an empty `dict` for the `kwds` (passes `NULL`); by not passing kwds, you save `dict` construction, C-style string to `str` construction/destruction, `dict` lookup, etc. – ShadowRanger Jan 12 '16 at 19:10

Dimitris Fasarakis Hilliard · Answer 2 · 2016-01-12T18:47:17.453

It probably is just is a combination of factors contributing to the overall slowdown.

Keyword args:

When Python sees the CALL_FUNCTION argument it will call call_function as you already pointed out. After going through some if clauses the call issued is x = do_call(func, pp_stack, na, nk);. Notice nk here which holds the total count of keyword arguments (so in the case of enumerate -> kw=1).

In do_call you'll see the following if clause:

if (nk > 0) {
    kwdict = update_keyword_args(NULL, nk, pp_stack, func);
    if (kwdict == NULL)
        goto call_fail;
}

If the number of keyword args is not zero (nk > 0), call update_keyword_args. Now, update_keyword_args does what you would expect, if orig_kwdict is NULL (which it is, look at the call to update_keyword_args) create a new dictionary:

if (orig_kwdict == NULL)
    kwdict = PyDict_New();

and then populate the dictionary with all values present in the value stack:

while (--nk >= 0) {
// copy from stack

These probably contribute significant to the overall delay.

Creating the `enum` object:

You're right about enum_new, if called with enumerate([1, 2, 3, 4], start=0) the variable start inside enum_new will have a value and therefore be != NULL. As a result the if clause will evaluate to True and the code inside it will execute, adding time to the call.

What is performed inside the if clause is not really heavy work, but it does contribute to the overall time required.

Additionally:

you also have the two additional byte code commands to consider, they might just be two but they add to the overall time taken due to the fact that we're timing really quick things (in the range of ns).
Again, insignificant from an overall standpoint, but, parsing a call with kws requires as before, a wee bit more time.

Finally:

I might be missing some stuff but overall these are some of the factors that, in conjunction, create overhead when creating a new enumerate object with start specified.

Why does enumerate execute slower when not specifying the keyword start?

2 Answers2

Keyword args:

Creating the enum object:

Creating the `enum` object: