Consider the following minimal example:
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True
cimport cython
from libc.stdlib cimport malloc
def main(size_t ni, size_t nt, size_t nx):
cdef:
size_t i, j, t, x, y
double[:, :, ::1] a = <double[:ni, :ni, :nx]>malloc(ni * ni * nx * sizeof(double))
double[:, :, ::1] b = <double[:nt, :ni, :nx]>malloc(nt * ni * nx * sizeof(double))
size_t[:, :, ::1] best = <size_t[:nt, :ni, :nx]>malloc(nt * ni * nx * sizeof(size_t))
size_t mxi
double s, mxs
for t in range(nt):
for j in range(ni):
for y in range(nx): # this loops does nothing but is needed for the effect below.
mxs = -1e300
for i in range(ni):
for x in range(nx):
with cython.boundscheck(False): # Faster!?!?
s = b[t, i, x] + a[i, j, x]
if s >= mxs:
mxs = s
mxi = i
best[t + 1, j, y] = mxi
return best[0, 0, 0]
essentially summing two 2D arrays along some specific axes and finding the maximizing index along another axis.
When compiled with gcc -O3 and called with the arguments (1, 2000, 2000), adding the boundscheck=True results in a twice faster execution than when boundscheck=False.
Any hint of why this would be the case? (Well, I can probably guess this has again to do with GCC autovectorization...)