Often, when gluing Python and C-code together, one needs to convert a Python-list to a continuous memory, e.g. an array.array
. It's also not unusual, that this conversion step becomes the bottle-neck, so I find myself doing silly things with Cython because it is faster than the build-in Python solutions.
For example to convert a Python-list lst
to an int32
continuous memory I'm aware of two possibilities:
a=array.array('i', lst)
and
a=array.array('i');
a.fromlist(lst)
They are however both slower than the following cython-version:
%%cython
import array
from cpython cimport array
def array_from_list_iter(lst):
cdef Py_ssize_t n=len(lst)
cdef array.array res=array.array('i')
cdef int cnt=0
array.resize(res, n) #preallocate memory
for i in lst:
res.data.as_ints[cnt]=i
cnt+=1
return res
My timings show (Linux, Python3.6 but the results are very similar for Windows and/or Python2.7), that the cython-solution is about 6 times faster:
Size new_array from_list cython_iter factor
1 284ns 347ns 176ns 1.6
10 599ns 621ns 209ns 2.9
10**2 3.7µs 3.5µs 578ns 6.1
10**3 38.5µs 32µs 4.3µs 7.4
10**4 343µs 316µs 40.4µs 7.8
10**5 3.5ms 3.4ms 481µs 7.1
10**6 34.1ms 31.5ms 5.0ms 6.3
10**7 353ms 316ms 53.3ms 5.9
With my limited understanding of CPython, I would say that the from_list
-solution uses this build-in function:
static PyObject *
array_array_fromlist(arrayobject *self, PyObject *list)
{
Py_ssize_t n;
if (!PyList_Check(list)) {
PyErr_SetString(PyExc_TypeError, "arg must be list");
return NULL;
}
n = PyList_Size(list);
if (n > 0) {
Py_ssize_t i, old_size;
old_size = Py_SIZE(self);
if (array_resize(self, old_size + n) == -1)
return NULL;
for (i = 0; i < n; i++) {
PyObject *v = PyList_GetItem(list, i);
if ((*self->ob_descr->setitem)(self,
Py_SIZE(self) - n + i, v) != 0) {
array_resize(self, old_size);
return NULL;
}
}
}
Py_RETURN_NONE;
}
a=array.array('i', lst)
grows dynamically and needs to reallocate, so that could explain some slow-down (yet as the measurements show, not by much!), but array_fromlist
preallocates the needed memory - it is basically exactly the same algorithm as the Cython-code.
So the question: Why is this Python-code 6 times slower than the Cython-code? What am I missing?
Here is the code for measuring timings:
import array
import numpy as np
for n in [1, 10,10**2, 10**3, 10**4, 10**5, 10**6, 10**7]:
print ("N=",n)
lst=list(range(n))
print("python:")
%timeit array.array('i', lst)
print("python, from list:")
%timeit a=array.array('i'); a.fromlist(lst)
print("numpy:")
%timeit np.array(lst, dtype=np.int32)
print("cython_iter:")
%timeit array_from_list_iter(lst)
The numpy-solution is about factor 2 slower than the python-versions.