cython boundscheck=True faster than boundscheck=False

Question

Consider the following minimal example:

#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True
cimport cython
from libc.stdlib cimport malloc

def main(size_t ni, size_t nt, size_t nx):
    cdef:
        size_t i, j, t, x, y
        double[:, :, ::1] a = <double[:ni, :ni, :nx]>malloc(ni * ni * nx * sizeof(double))
        double[:, :, ::1] b = <double[:nt, :ni, :nx]>malloc(nt * ni * nx * sizeof(double))
        size_t[:, :, ::1] best = <size_t[:nt, :ni, :nx]>malloc(nt * ni * nx * sizeof(size_t))
        size_t mxi
        double s, mxs
    for t in range(nt):
        for j in range(ni):
            for y in range(nx): # this loops does nothing but is needed for the effect below.
                mxs = -1e300
                for i in range(ni):
                    for x in range(nx):
                        with cython.boundscheck(False): # Faster!?!?
                            s = b[t, i, x] + a[i, j, x]
                        if s >= mxs:
                            mxs = s
                            mxi = i
                best[t + 1, j, y] = mxi
    return best[0, 0, 0]

essentially summing two 2D arrays along some specific axes and finding the maximizing index along another axis.

When compiled with gcc -O3 and called with the arguments (1, 2000, 2000), adding the boundscheck=True results in a twice faster execution than when boundscheck=False.

Any hint of why this would be the case? (Well, I can probably guess this has again to do with GCC autovectorization...)

In my tests, the version with `with cython.boundscheck(True)` is about 3 times slower. I think the memory for `a`, `b`, `best` is all uninitialized because of `malloc`. I changed that to an equivalent `calloc` call. Also, the line `best[t + 1, j, y]` seems to be indexing invalid memory when `t == nt - 1`. — Alok--, Jun 10 '15 at 00:15
Changing mallocs to callocs and replacing `t+1` by `t` does not (qualitatively) change the results for me. My `setup.py` has `extra_compile_args=["-O3"]` and I am using gcc 5.1.0. — antony, Jun 10 '15 at 18:07
OK, I tried with -O3, and I get the same speed with both versions. I am using gcc 4.9.1. Looking at the generated C code, I think it's possible that gcc is smart enough to know that the extra bounds checks are never going to be triggered (because of the for loop condition). I don't know why you get such different speeds though. — Alok--, Jun 11 '15 at 16:24
I just tested it with gcc 4.9.2 (on another machine), and still get similar results, though the difference is only ~50% now. — antony, Jun 12 '15 at 20:37

patapouf_ai · Answer 1 · 2015-06-10T07:49:24.560

Boundscheck is a security check that you are accessing indices inside the bounds of the vectors. If you don't bother to do the check if the indices can go out of bounds then it is faster. It takes time to perform the check.

That is, if boundcheck is true, it will check to see if the index is inside the range of the vector before reading or writing to memory. And if not it will throw an error. If boundcheck is false, it will read or write to the pointer even if the index is out of bounds, given out false data by reading and writing to memory corrupting data by writing.

From documentation:

The array lookups are still slowed down by two factors:

1) Bounds checking is performed.

2) Negative indices are checked for and handled correctly.

The consequences of not bound checking being:

Now bounds checking is not performed (and, as a side-effect, if you ‘’do’’ happen to access out of bounds you will in the best case crash your program and in the worst case corrupt data).

Where this is particularly important is you can have None vectors. Here is the warning from the documentation:

Warning

Speed comes with some cost. Especially it can be dangerous to set typed objects (like f, g and h in our sample code) to None. Setting such objects to None is entirely legal, but all you can do with them is check whether they are None. All other use (attribute lookup or indexing) can potentially segfault or corrupt data (rather than raising exceptions as they would in Python).

The actual rules are a bit more complicated but the main message is clear: Do not use typed objects without knowing that they are not set to None.

http://docs.cython.org/src/userguide/numpy_tutorial.html

I think the reason the question was asked is that in this case `boundscheck(True)` is faster, which is counter-intuitive. — DavidW, Jun 10 '15 at 12:26
Ohh!!!!! Good point! It is so counter intuitive that I even read the question wrong! I thought the question asked the opposite. — patapouf_ai, Jun 10 '15 at 14:35

cython boundscheck=True faster than boundscheck=False

1 Answers1