Why is the build-in array.fromlist() slower than cython-code?

Question

Often, when gluing Python and C-code together, one needs to convert a Python-list to a continuous memory, e.g. an array.array. It's also not unusual, that this conversion step becomes the bottle-neck, so I find myself doing silly things with Cython because it is faster than the build-in Python solutions.

For example to convert a Python-list lst to an int32 continuous memory I'm aware of two possibilities:

a=array.array('i', lst)

and

a=array.array('i'); 
a.fromlist(lst)

They are however both slower than the following cython-version:

%%cython
import array
from cpython cimport array
def array_from_list_iter(lst):
    cdef Py_ssize_t n=len(lst)
    cdef array.array res=array.array('i')
    cdef int cnt=0
    array.resize(res, n)  #preallocate memory
    for i in lst:
       res.data.as_ints[cnt]=i
       cnt+=1
    return res

My timings show (Linux, Python3.6 but the results are very similar for Windows and/or Python2.7), that the cython-solution is about 6 times faster:

Size       new_array   from_list  cython_iter    factor
1             284ns    347ns        176ns           1.6
10            599ns    621ns        209ns           2.9
10**2         3.7µs    3.5µs        578ns           6.1
10**3        38.5µs    32µs         4.3µs           7.4
10**4         343µs    316µs       40.4µs           7.8
10**5         3.5ms    3.4ms        481µs           7.1
10**6        34.1ms    31.5ms       5.0ms           6.3
10**7         353ms    316ms       53.3ms           5.9

With my limited understanding of CPython, I would say that the from_list-solution uses this build-in function:

static PyObject *
array_array_fromlist(arrayobject *self, PyObject *list)
{
    Py_ssize_t n;

    if (!PyList_Check(list)) {
        PyErr_SetString(PyExc_TypeError, "arg must be list");
        return NULL;
    }
    n = PyList_Size(list);
    if (n > 0) {
        Py_ssize_t i, old_size;
        old_size = Py_SIZE(self);
        if (array_resize(self, old_size + n) == -1)
            return NULL;
        for (i = 0; i < n; i++) {
            PyObject *v = PyList_GetItem(list, i);
            if ((*self->ob_descr->setitem)(self,
                            Py_SIZE(self) - n + i, v) != 0) {
                array_resize(self, old_size);
                return NULL;
            }
        }
    }
    Py_RETURN_NONE;
}

a=array.array('i', lst) grows dynamically and needs to reallocate, so that could explain some slow-down (yet as the measurements show, not by much!), but array_fromlist preallocates the needed memory - it is basically exactly the same algorithm as the Cython-code.

So the question: Why is this Python-code 6 times slower than the Cython-code? What am I missing?

Here is the code for measuring timings:

import array
import numpy as np
for n in [1, 10,10**2, 10**3, 10**4, 10**5, 10**6, 10**7]:
    print ("N=",n)
    lst=list(range(n))
    print("python:")
    %timeit array.array('i', lst)
    print("python, from list:")
    %timeit a=array.array('i'); a.fromlist(lst)
    print("numpy:")
    %timeit np.array(lst, dtype=np.int32)
    print("cython_iter:")
    %timeit array_from_list_iter(lst)

The numpy-solution is about factor 2 slower than the python-versions.

I suspect if you look at your Cython output, you'll find that the process of setting each item in the output array bypasses the `self->ob_descr->setitem` function call, setting the memory directly. It will operate incorrectly if the array isn't actually made of `int`s, but in exchange, it replaces a function call through multiple pointers including type checking each time (to determine the width of the target each and every time) with a direct memory assignment. Also, iterating the `list` might be slightly faster than repeated calls to `PyList_GetItem` (which performs repeated bounds checks). — ShadowRanger, Feb 14 '18 at 11:00
Your `cython` code return only result, but `python array.array` is `['append', 'buffer_info', 'byteswap', 'count', 'extend', 'fromfile', 'fromlist', 'fromstring', 'fromunicode', 'index', 'insert', 'itemsize', 'pop', 'read', 'remove', 'reverse', 'tofile', 'tolist', 'tostring', 'tounicode', 'typecode', 'write']` — dsgdfg, Feb 14 '18 at 11:10
@ShadowRanger cython writes direct to the memory, but also checks that the the range of passed python-integers is right. For example `array_from_list_iter([2**33])` would produce `Python int too large to convert to C long`-error. — ead, Feb 14 '18 at 12:13
@ead: The `setitem` function would have to do the same thing, and it would have to choose the "too large" bound dynamically. — ShadowRanger, Feb 14 '18 at 12:17
Like @ShadowRanger said, the Cython generated C code does not use `setitem` but directly assigns memory at indices of the array after converting with `Pyx_PyInt_As_int` which is where the range checks come from. That is the only major difference from the `fromlist` C code. The Cython code could be made slightly faster still by using the known size of the list for iteration to avoid bounds checks - `for i in lst[:n]`. The assignment in Cython looks like `(__pyx_v_res->data.as_ints[__pyx_v_cnt]) = __pyx_t_3;` — danny, Feb 14 '18 at 16:29

score 1 · Accepted Answer · answered Feb 14 '18 at 17:06

The biggest difference seems to be the actual int unboxing. The CPython array implementation is using PyArg_Parse while cython is calling PyLong_AsLong - at least I think, through several layers of macros.

%%cython -a
from cpython cimport PyArg_Parse
def arg_parse(obj):
    cdef int i
    for _ in range(100000):
        PyArg_Parse(obj, "i;array item must be integer", &i)
    return i

def cython_parse(obj):
    cdef int i
    for _ in range(100000):
        i = obj
    return i

%timeit arg_parse(1)
# 2.52 ms ± 67.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit cython_parse(1)
# 299 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thanks, you are right: when I replace `PyArg_Parse` with `PyLong_AsLong` in `arraymodule.c` the python's version becomes almost as fast as cython's. By looking at cython's `__Pyx_PyInt_As_int` I can understand why somebody would stick with the slow `PyArg_Parse`:) — ead, Feb 14 '18 at 22:40

Why is the build-in array.fromlist() slower than cython-code?

1 Answers1