Binary Erode Speed-up with Cython

Question

I am looking to accelerate a binary erosion image processing function with Cython, although I am new to Cython. I am not seeing the drastic speedups I was expecting. I am looking for help to optimize this code as I am still not familiar with how C types, indexing, memory views, and objects can be utilized to enhance performance. Below is the source code and output of the Cython function, python function using SciPy module, setup.py, and jupyter notebook.

Cython code erode.pyx

import numpy as np
cimport numpy as np

DTYPE = np.int_
ctypedef np.int_t DTYPE_t

def erode(long [:,:] img):

    # Variables
    cdef int height, width, local_min
    cdef int vals[5]
    height = img.shape[0]
    width = img.shape[1]

    # Padded Array
    cdef np.ndarray[DTYPE_t, ndim=2] padded = np.zeros((height+2, width+2), dtype = DTYPE)
    padded[1:height+1,1:width+1] = img

    #Return array
    cdef np.ndarray[DTYPE_t, ndim=2] eroded = np.zeros((height,width),dtype=DTYPE)

    cdef int i,j
    for i in range(height):
        for j in range(width):
            vals = [padded[i+1,j+1], padded[i,j+1], padded[i+1,j],padded[i+1,j+2],padded[i+2,j+1]]
            local_min = min(vals)
            eroded[i,j] = local_min
    return eroded

Python code erode_py.py

import numpy as np
from scipy.ndimage import binary_erosion


def erode_py(img):

    strel = np.array([[0, 1, 0],
                    [1, 1, 1],
                    [0, 1, 0]], dtype=np.uint8)
    img = img.astype(np.uint8)
    eroded_image = binary_erosion(img, strel, border_value=0)
    return eroded_image

setup.py

from distutils.core import setup
from Cython.Build import cythonize
import numpy


setup(
    name='binary_erode_build',
    ext_modules=cythonize("erode.pyx"),
    include_dirs=[numpy.get_include()]
)

Jupyter notebook

import numpy as np
import erode
import erode_py

obj = np.array([[0, 0, 0, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 1, 1, 0],
       [0, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0],
       [0, 1, 1, 1, 1, 1, 0, 0]], dtype=np.int_)

%timeit -n100 -r100 erode.erode(obj)
%timeit -n100 -r100 erode_py.erode_py(obj)

42.8 µs ± 10.3 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
44.2 µs ± 14.4 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)

What have you tried? What functions have been identified as taking too long and need further enhancements? It looks like time complexity is `O(height*width)` from the nested loops. Since all of that is already using numpy and compiled code, there is not much more to be done with that algorithm. The loops could be run in parallel though to use more than one core. — danny, Apr 11 '18 at 16:00
It also looks like you are calling a python function `min` inside of your double for loop. You should probably try to replace that with a `cdef` function equivalent. — CodeSurgeon, Apr 11 '18 at 16:04
Run `cython -a erode.pyx`, and view `erode.html` in your browser. The various shades of yellow highlighting warn you about code generated by cython that is calling back into the Python API (i.e. it is not "pure C"). In your case, the first two lines of the three in the body of your inner loop are dark yellow. — Warren Weckesser, Apr 11 '18 at 19:50
For the line that indexes into `padded`, use a single "flat" index rather than 2D indices. That's faster. Agreed with above that replacing min with a cdef function will also help a lot. I'd also not use numpy arrays inside your function, just use typed memoryviews everywhere. That is, padded could be defined like: `cdef DTYPE_T[:, :] padded = np.zeros((height+2, width+2), dtype = DTYPE)`. Also you almost certainly want to be using np.intp_t instead of np.int_t. Finally you should also disable bounds checking and wraparound for your function via a compiler directive decorator. — ngoldbaum, Apr 11 '18 at 20:10
@ngoldbaum 2D indices really aren't the problem - using a flat index will just replicate the calculation that Cython does. The issue on that line is building a list of Python objects. — DavidW, Apr 11 '18 at 20:22
Ah yes, avoiding creating the list will help. Just populate your C array with a for loop. — ngoldbaum, Apr 11 '18 at 20:23
@DavidW It looks like the python list was a big issue. The code is now 4x faster with that edit. — Alex Magsam, Apr 11 '18 at 22:46
@ngoldbaum why use np.intp_t instead? I am pretty new to C types so if you have any references to help that would be great also. — Alex Magsam, Apr 11 '18 at 22:46
`intp_t` is an alias to `ssize_t`, which is the type for the sizes of things. Whether it's a 32 or 64 bit int is platform dependent. `int_t` is always the C `long` type. Your code would be more cross-platform compatible if you use `inp_t`. — ngoldbaum, Apr 11 '18 at 23:09
Also you really do want to take a look at the per-function compiler directive decorators I mentioned. You will almost certainly see speedups if you turn off autowrap and bounds checking. — ngoldbaum, Apr 11 '18 at 23:10
I think the suggestion to use `intp_t` is pretty bad advice - you've simply swapped one arbitrary platform dependent type for another. Use `intp_t` only for it's intended purpose: when you need an integer just big enough to store a pointer in. I'd suggest using a fixed width integer type instead: `int8_t`, `int16_t`, `int32_t` or `int64_t`. Since all the numbers here look to be 0 or 1 `int8_t` might be best because it's smallest — DavidW, Apr 12 '18 at 19:20

Binary Erode Speed-up with Cython

0 Answers0