Function call overhead - why do builtin Python builtins appear to be faster than my builtins?

Question

I've been interested in overheads, so I wrote a minimal C extension exporting two functions nop and starnop that do more or less nothing. They just pass through their input (the two relevant functions are right at the top the rest is just tedious boiler plate code):

amanmodule.c:

#include <Python.h>

static PyObject* aman_nop(PyObject *self, PyObject *args)
{
  PyObject *obj;

  if (!PyArg_UnpackTuple(args, "arg", 1, 1, &obj))
    return NULL;
  Py_INCREF(obj);
  return obj;
}

static PyObject* aman_starnop(PyObject *self, PyObject *args)
{
  Py_INCREF(args);
  return args;
}

static PyMethodDef AmanMethods[] = {
  {"nop",  (PyCFunction)aman_nop, METH_VARARGS,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"starnop", (PyCFunction)aman_starnop, METH_VARARGS,
   PyDoc_STR("starnop(*args) -> args\n\nReturn tuple of args unchanged")},
  {NULL, NULL}
};

static struct PyModuleDef amanmodule = {
    PyModuleDef_HEAD_INIT,
    "aman",
    "aman - a module about nothing.\n\n"
    "Provides functions 'nop' and 'starnop' which do nothing:\n"
    "nop(arg) -> arg; starnop(*args) -> args\n",
    -1,
    AmanMethods
};

PyMODINIT_FUNC
PyInit_aman(void)
{
    return PyModule_Create(&amanmodule);
}

setup.py:

from setuptools import setup, extension

setup(name='aman', version='1.0',
      ext_modules=[extension.Extension('aman', ['amanmodule.c'])],
      author='n.n.',
      description="""aman - a module about nothing

      Provides functions 'nop' and 'starnop' which do nothing:
      nop(arg) -> arg; starnop(*args) -> args
      """,
      license='public domain',
      keywords='nop pass-through identity')

Next, I time them against pure Python implementations and a couple of builtins that also do next to nothing:

import numpy as np
from aman import nop, starnop
from timeit import timeit

def mnsd(x): return '{:8.6f} \u00b1 {:8.6f} \u00b5s'.format(np.mean(x), np.std(x))

def pnp(x): x

globals={}
for globals['nop'] in (int, bool, (0).__add__, hash, starnop, nop, pnp, lambda x: x):
    print('{:60s}'.format(repr(globals['nop'])),
          mnsd([timeit('nop(1)', globals=globals) for i in range(10)]),
          '  ',
          mnsd([timeit('nop(True)',globals=globals) for i in range(10)]))

First Question I'm not doing something retarded methodology-wise?

Results for 10 blocks of 1,000,000 calls each:

<class 'int'>                                                0.099754 ± 0.003917 µs    0.103933 ± 0.000585 µs
<class 'bool'>                                               0.097711 ± 0.000661 µs    0.094412 ± 0.000612 µs
<method-wrapper '__add__' of int object at 0x8c7000>         0.065146 ± 0.000728 µs    0.064976 ± 0.000605 µs
<built-in function hash>                                     0.039546 ± 0.000671 µs    0.039566 ± 0.000452 µs
<built-in function starnop>                                  0.056490 ± 0.000873 µs    0.056234 ± 0.000181 µs
<built-in function nop>                                      0.060094 ± 0.000799 µs    0.059959 ± 0.000170 µs
<function pnp at 0x7fa31c0512f0>                             0.090452 ± 0.001077 µs    0.098479 ± 0.003314 µs
<function <lambda> at 0x7fa31c051378>                        0.086387 ± 0.000817 µs    0.086536 ± 0.000714 µs

Now my actual question: even though my nops are written in C and do nothing (starnop doesn't even parse its arguments) the builtin hash function is consistently faster. I know that ints are their own hash values in Python, so hash also is a nop here but it isn't nopper than my nops, so why the speed difference?

Update: Completely forgot: I'm on a pretty standard x86_64 machine, linux gcc4.8.5. The extension I install using python3 setup.py install --user.

How are you compiling your C code? You haven't told us this vital piece of information. Also, how is Python being compiled? Are there any optimisations enabled in Python which aren't in your code? — autistic, Dec 10 '17 at 07:07
@Sebivor just calling the setup script: `python3 setup.py install --user`. I always assumed that uses the same compiler settings Python itself was compiled with, unless you explicitly specify otherwise. I'll update the question. — Paul Panzer, Dec 10 '17 at 07:49
... and the answer to the other questions I asked? Do you have those, too? If not, you're not doing enough research to ask this question... — autistic, Dec 10 '17 at 08:30
You could save yourself a lot of typing (of this question) by doing research before-hand, you know? A sensible place to begin is reading your compilers manual pages, which will tell you all about plenty of subtle optimisations that happen under the hood, among other helpful stuff which you'll likely later ask about. — autistic, Dec 10 '17 at 08:32
That is not how you're compiling your C code. Please show us how you're compiling your C code... — autistic, Dec 10 '17 at 08:34
@Sebivor relax, as I was trying to explain and as the soon to be accepted answer confirms compiler issues were a theoretically possible but not very likely explanation. The python build system and the setuptools are highly sophisticated, in a simple case like this one setuptools does the entire build for you, using the same compiler settings that were used to build python - why on earth should it do anything different. The line I showed is literally the only thing I had to do. — Paul Panzer, Dec 10 '17 at 09:11
Relaxing is good to practice while reading manuals. The last thing you'd want is to read a manual with an angry, abusive tone in your head; you might erroneously think the person who wrote the manual was verbose because they were getting irritated. — autistic, Dec 10 '17 at 10:37

score 4 · Accepted Answer · answered Dec 10 '17 at 07:52

Much (most?) of the overhead in Python function calls is the creation of the args tuple. The argument parsing also adds some overhead.

Functions defines using the the METH_VARARGS calling convention require the creation of a tuple to store all the arguments. If you just need a single argument, you can use the METH_O calling convention. With METH_O, no tuple is created. The single argument is passed directly. I've added a nop1 to your example which uses METH_O.

It's possible define functions that do not require an argument using METH_NOARGS. See nop2 for the least possible overhead.

When using METH_VARARGS, it is possible to decrease the overhead slightly by directly parsing the args tuple instead of calling PyArg_UnpackTuple or the related PyArg_ functions. It is slightly faster. See nop3.

The builtin hash() function used the METH_O calling convention.

Modified amanmodule.c

#include <Python.h>

static PyObject* aman_nop(PyObject *self, PyObject *args)
{
  PyObject *obj;

  if (!PyArg_UnpackTuple(args, "arg", 1, 1, &obj))
    return NULL;
  Py_INCREF(obj);
  return obj;
}

static PyObject* aman_nop1(PyObject *self, PyObject *other)
{
  Py_INCREF(other);
  return other;
}

static PyObject* aman_nop2(PyObject *self)
{
  Py_RETURN_NONE;
}

static PyObject* aman_nop3(PyObject *self, PyObject *args)
{
  PyObject *obj;

  if (PyTuple_GET_SIZE(args) == 1) {
    obj = PyTuple_GET_ITEM(args, 0);
    Py_INCREF(obj);
    return obj;
  }
  else {
    PyErr_SetString(PyExc_TypeError, "nop3 requires 1 argument");
    return NULL;
  }
}

static PyObject* aman_starnop(PyObject *self, PyObject *args)
{
  Py_INCREF(args);
  return args;
}

static PyMethodDef AmanMethods[] = {
  {"nop",  (PyCFunction)aman_nop, METH_VARARGS,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"nop1",  (PyCFunction)aman_nop1, METH_O,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"nop2",  (PyCFunction)aman_nop2, METH_NOARGS,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"nop3",  (PyCFunction)aman_nop3, METH_VARARGS,
   PyDoc_STR("nop(arg) -> arg\n\nReturn arg unchanged.")},
  {"starnop", (PyCFunction)aman_starnop, METH_VARARGS,
   PyDoc_STR("starnop(*args) -> args\n\nReturn tuple of args unchanged")},
  {NULL, NULL}
};

static struct PyModuleDef amanmodule = {
    PyModuleDef_HEAD_INIT,
    "aman",
    "aman - a module about nothing.\n\n"
    "Provides functions 'nop' and 'starnop' which do nothing:\n"
    "nop(arg) -> arg; starnop(*args) -> args\n",
    -1,
    AmanMethods
};

PyMODINIT_FUNC
PyInit_aman(void)
{
    return PyModule_Create(&amanmodule);
}

Modified test.py

import numpy as np
from aman import nop, nop1, nop2, nop3, starnop
from timeit import timeit

def mnsd(x): return '{:8.6f} \u00b1 {:8.6f} \u00b5s'.format(np.mean(x), np.std(x))

def pnp(x): x

globals={}
for globals['nop'] in (int, bool, (0).__add__, hash, starnop, nop, nop1, nop3, pnp, lambda x: x):
    print('{:60s}'.format(repr(globals['nop'])),
          mnsd([timeit('nop(1)', globals=globals) for i in range(10)]),
          '  ',
          mnsd([timeit('nop(True)',globals=globals) for i in range(10)]))

# To test with no arguments
for globals['nop'] in (nop2,):
    print('{:60s}'.format(repr(globals['nop'])),
          mnsd([timeit('nop()', globals=globals) for i in range(10)]),
          '  ',
          mnsd([timeit('nop()',globals=globals) for i in range(10)]))

Results

$ python3 test.py  
<class 'int'>                                                0.080414 ± 0.004360 µs    0.086166 ± 0.003216 µs
<class 'bool'>                                               0.080501 ± 0.008929 µs    0.075601 ± 0.000598 µs
<method-wrapper '__add__' of int object at 0xa6dca0>         0.045652 ± 0.004229 µs    0.044146 ± 0.000114 µs
<built-in function hash>                                     0.035122 ± 0.003317 µs    0.033419 ± 0.000136 µs
<built-in function starnop>                                  0.044056 ± 0.001300 µs    0.044280 ± 0.001629 µs
<built-in function nop>                                      0.047297 ± 0.000777 µs    0.049536 ± 0.007577 µs
<built-in function nop1>                                     0.030402 ± 0.001423 µs    0.031249 ± 0.002352 µs
<built-in function nop3>                                     0.044673 ± 0.004041 µs    0.042936 ± 0.000177 µs
<function pnp at 0x7f946342d840>                             0.071846 ± 0.005377 µs    0.071085 ± 0.003314 µs
<function <lambda> at 0x7f946342d8c8>                        0.066621 ± 0.001499 µs    0.067163 ± 0.002962 µs
<built-in function nop2>                                     0.027736 ± 0.001487 µs    0.027035 ± 0.000397 µs

Wow, thanks a lot! I'll wait a little more but I somehow doubt a better answer will come up. — Paul Panzer, Dec 10 '17 at 08:01
Thanks again, a complete informed and enjoyable answer. And the speed gained for my `nop` is also quite nice. — Paul Panzer, Dec 10 '17 at 14:14

Function call overhead - why do builtin Python builtins appear to be faster than my builtins?

1 Answers1