Why is Cython slower than vectorized NumPy?

Question

Consider the following Cython code :

cimport cython
cimport numpy as np
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

@cython.boundscheck(False)
@cython.wraparound(False)
def test_numpy(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

def test_numpyvec(a, b):
    a += b

def gendata(nb=40000000):
    a = np.random.random(nb)
    b = np.random.random(nb)
    return a, b

Running it in the interpreter yields (after a few runs to warm up the cache) :

In [14]: %timeit -n 100 test_memoryview(a, b)
100 loops, best of 3: 148 ms per loop

In [15]: %timeit -n 100 test_numpy(a, b)
100 loops, best of 3: 159 ms per loop

In [16]: %timeit -n 100 test_numpyvec(a, b)
100 loops, best of 3: 124 ms per loop

# See answer below :
In [17]: %timeit -n 100 test_raw_pointers(a, b)
100 loops, best of 3: 129 ms per loop

I tried it with different dataset sizes, and consistently had the vectorized NumPy function run faster than the compiled Cython code, while I was expecting Cython to be on par with vectorized NumPy in terms of performance.

Did I forget an optimization in my Cython code? Does NumPy use something (BLAS?) in order to make such simple operations run faster? Can I improve the performance of this code?

Update: The raw pointer version seems to be on par with NumPy. So apparently there's some overhead in using memory view or NumPy indexing.

10 loops: Are you really running the performance tests only 10 times to get an average? If so, then the normal variance might be bigger than what you try to measure. Try 100000 times instead. — Aaron Digulla, Jun 19 '14 at 13:53
Is this Python 2.x? If so, `range` might account for some of the difference — YXD, Jun 19 '14 at 13:58
@AaronDigulla : I updated the question with the timings for 100 runs — F.X., Jun 19 '14 at 14:05
@MrE: I was under the impression that Cython automatically converted the use of `range` to C loops. Was I wrong? — F.X., Jun 19 '14 at 14:06
@F.X. I think you are right about the `range()`. I checked the generated `C` code and it is transformed directly to a `C` for loop — Saullo G. P. Castro, Jun 19 '14 at 14:07
Depending on you hardware and numpy version, some basic math operations may be using [SSE2](http://en.wikipedia.org/wiki/SSE2) instructions, and thus run twice faster with `double`, or 4 times faster with `float`, than a vanilla C/Cython implementation. — Jaime, Jun 19 '14 at 16:47
@Jaime : Is that supported directly by NumPy, does GCC do that on its own when compiling or is that provided by BLAS/MKL/Atlas? — F.X., Jun 20 '14 at 10:11
It is provided directly by numpy if your hardware supports it, has nothing to do with the libraries installed. And I don't think compilers can do SSE on their own just yet, but am not really sure. — Jaime, Jun 20 '14 at 13:40
Hmm. Using SSE2 would cause a performance gap like this, but I wouldn't expect the version using raw pointers to perform as well as numpy if that was what made the difference. — IanH, Jun 21 '14 at 16:46
It would be useful to also have the definition of test_raw_pointers... Since it seems to be the fastest solution when we can't use the numpy version. [ok, see the answer below...] Moreover, it would be interesting to have the same testing with a slightly more complicate operation, a + b^2 + 2ba for example. In such cases, we can not use a pure numpy function (a += b**2 + 2*b*a would be much slower). So what would be the fastest solution? — paugier, Jul 01 '14 at 07:12

score 10 · Accepted Answer · answered Jun 19 '14 at 13:58

10

Another option is to use raw pointers (and the global directives to avoid repeating @cython...):

#cython: wraparound=False
#cython: boundscheck=False
#cython: nonecheck=False

#...

cdef ctest_raw_pointers(int n, double *a, double *b):
    cdef int i
    for i in range(n):
        a[i] += b[i]

def test_raw_pointers(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
    ctest_raw_pointers(a.shape[0], &a[0], &b[0])

answered Jun 19 '14 at 13:58

Saullo G. P. Castro

56,802
26
179
234

Good idea, I'll update the question with the timings from this function! – F.X. Jun 19 '14 at 14:09
1

See my update. Apparently raw pointers seem to be on par with the vectorized NumPy version. I'm going to investigate this further, and if there's no better option I'll accept your answer. – F.X. Jun 19 '14 at 15:09
1

I actually didn't, so I'm going to accept your answer, thanks for reminding me! – F.X. Apr 10 '15 at 06:17

score 3 · Answer 2 · answered Jun 21 '14 at 16:32

On my machine the difference isn't as large, but I can nearly eliminate it by changing the numpy and memory view functions like this

@cython.boundscheck(False)
@cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
    cdef int i, n=a.shape[0]
    for i in range(n):
        a[i] += b[i]

@cython.boundscheck(False)
@cython.wraparound(False)
def test_numpy(np.ndarray[double] a, np.ndarray[double] b):
    cdef int i, n=a.shape[0]
    for i in range(n):
        a[i] += b[i]

and then, when I compile the C output from Cython, I use the flags -O3 and -march=native. This seems to indicate that the difference in timings comes from the use of different compiler optimizations.

I use the 64 bit version of MinGW and NumPy 1.8.1. Your results will probably vary depending on your package versions, hardware, platform, and compiler.

If you are using the IPython notebook's Cython magic, you can force an update with the additional compiler flags by replacing %%cython with %%cython -f -c=-O3 -c=-march=native

If you are using a standard setup.py for your cython module you can specify the extra_compile_args argument when creating the Extension object that you pass to distutils.setup.

Note: I removed the ndim=1 flag when specifying the types for the NumPy arrays because it isn't necessary. That value defaults to 1 anyway.

I'm using a setup.py file because I didn't know about the IPython magic, which is pretty nice btw! IIRC distutils default to -O2 when compiling extensions, maybe that's what's happening here. I'll look into it on Monday! — F.X., Jun 21 '14 at 16:53

score 2 · Answer 3 · answered Jun 20 '14 at 20:30

2

A change that slightly increases the speed is to specify the stride:

def test_memoryview_inorder(double[::1] a, double[::1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

answered Jun 20 '14 at 20:30

Veedrac

58,273
15
112
169

I have a two-dimensional array and tried to specify `double[::1, ::1] b`, but was told "Cannot specify an array that is both C and Fortran contiguous.". Writing just `double[:, ::1] b` compiles. Is there a way to use your answers for two dimensions? – Thomas Ahle Jan 14 '21 at 17:22
@ThomasAhle https://docs.cython.org/en/latest/src/userguide/memoryviews.html#view-general-layouts, I think `double[:, ::1]` (or `double[::1, :]`) should be fine. – Veedrac Jan 14 '21 at 22:07
This gave a factor 2 improvement in my code. Can you refer me to somewhere I can learn about what's going on? – Thomas Ahle Jan 14 '21 at 23:10

Why is Cython slower than vectorized NumPy?

3 Answers3