How to make Cython much faster than Python (without Numpy) for adding two arrays together?

Question

I want to use Cython to decrease the time it takes to add two arrays together (element-wise) without using Numpy arrays. The basic Python approach that I found to be the fastest is to use list comprehension, as follows:

def add_arrays(a,b):
    return [m + n for m,n in zip(a,b)]

My Cython approach is a little more complicated and it looks as follows:

from array import array
from libc.stdlib cimport malloc
from cython cimport boundscheck,wraparound

@boundscheck(False)
@wraparound(False)
cpdef add_arrays_Cython(int[:] Aarr, int[:] Barr):
    cdef size_t i, I
    I = Aarr.shape[0]
    cdef int *Carr = <int *> malloc(640000 * sizeof(int))
    for i in range(I):
        Carr[i] = Aarr[i]+Barr[i]
    result_as_array  = array('i',[e for e in Carr[:640000]])
    return result_as_array

Note that I use @boundscheck(False) and @wraparound(False) to make it even faster. Also, I am concerned about a very large array (size 640000) and I found it crashes if I simply use cdef int Carr[640000] so I used malloc(), which solved that problem. Lastly, I return the data structure as a Python array of type integer.

To profile the code I ran the following:

a = array.array('i', range(640000)) #create integer array
b = a[:] #array to add

T=time.clock()
for i in range(20): add_arrays(a,b) #Python list comprehension approach
print(time.clock() - T)

>6.33 seconds

T=time.clock()
for i in range(20): add_arrays_Cython(a,b) #Cython approach
print(time.clock() - T)

> 4.54 seconds

Evidently, the Cython-based approach gives a speed-up of about 30%. I expected that the speed-up would be closer to an order of magnitude or even more (like it does for Numpy).

What can I do to speed-up the Cython code further? Are there any obvious bottlenecks in my code? I am a beginner to Cython so I may be misunderstanding something.

Make the sure the description is clear as to when you are using Python `list` versus `array.array`. `numpy` has, for the most part, replaced the builtin `array` package. I don't know how well `cython` implements it. To maximize speed look into using the `array's` buffer interface together with `cython's` `typed memoryview`. — hpaulj, Mar 31 '20 at 16:25

Stefan Dragnev · Accepted Answer · 2020-03-31T12:15:23.103

The biggest bottleneck is the conversion of the result pointer back to an array.

Here's an optimized version:

from cython cimport boundscheck,wraparound
from cython cimport view

@boundscheck(False)
@wraparound(False)
cpdef add_arrays_Cython(int[:] Aarr, int[:] Barr):
    cdef size_t i, I
    I = Aarr.shape[0]
    result_as_array = view.array(shape=(I,), itemsize=sizeof(int), format='i')
    cdef int[:] Carr = result_as_array
    for i in range(I):
        Carr[i] = Aarr[i]+Barr[i]
    return result_as_array

Few things to note here - instead of malloc'ing a temporary buffer and then copying the result to an array, I create cython.view.array and cast it to a int[:]. This gives me the raw speed of pointer access and also avoids the unnecessary copying. I also return the Cython object directly, without converting it to a python object first. In total, this gives me a 70x speed-up, compared to your original Cython implementation.

Converting the view object to a list proved tricky: if you simply change the return statement to return list(result_as_array), the code became about 10x slower than your initial implementation. But if you add an extra layer of wrapping like so: return list(memoryview(result_as_array)) the function was about 5x faster than your version. So again, the main overhead was going from the fast native object to a generic python one and this should always be avoided, if you need fast code.

For comparison I ran the code with numpy. The numpy version performed exactly as fast as my Cython version. This means that the C compiler was able to automatically vectorize the pairwise summation loop inside my code.

Side-note: you need to call free() on malloc()'d pointers, otherwise you leak memory.

If a list should be returned, then one could construct a list using C-API, that would be faster than create an intermediate array. — ead, Mar 31 '20 at 12:24
@Stefan Thank you for your helpful and pedagogical answer. Indeed I achieved about 5x speed-up! I was hoping for more but I guess it is a lesson that I should work with the view object. — CodeWanderer, Apr 01 '20 at 11:41
@ead Thanks for the suggestions. I tried int[::1] but it did not change the speed much. Also I tried to convert to array.array using `array.array('i',memoryview(result_as_array))` but that incurred a significant slow-down. — CodeWanderer, Apr 01 '20 at 11:43

How to make Cython much faster than Python (without Numpy) for adding two arrays together?

1 Answers1