Consider the following Cython code :
cimport cython
cimport numpy as np
import numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]
@cython.boundscheck(False)
@cython.wraparound(False)
def test_numpy(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]
def test_numpyvec(a, b):
a += b
def gendata(nb=40000000):
a = np.random.random(nb)
b = np.random.random(nb)
return a, b
Running it in the interpreter yields (after a few runs to warm up the cache) :
In [14]: %timeit -n 100 test_memoryview(a, b)
100 loops, best of 3: 148 ms per loop
In [15]: %timeit -n 100 test_numpy(a, b)
100 loops, best of 3: 159 ms per loop
In [16]: %timeit -n 100 test_numpyvec(a, b)
100 loops, best of 3: 124 ms per loop
# See answer below :
In [17]: %timeit -n 100 test_raw_pointers(a, b)
100 loops, best of 3: 129 ms per loop
I tried it with different dataset sizes, and consistently had the vectorized NumPy function run faster than the compiled Cython code, while I was expecting Cython to be on par with vectorized NumPy in terms of performance.
Did I forget an optimization in my Cython code? Does NumPy use something (BLAS?) in order to make such simple operations run faster? Can I improve the performance of this code?
Update: The raw pointer version seems to be on par with NumPy. So apparently there's some overhead in using memory view or NumPy indexing.