Cython call optimization

Question

I've got a Python function I try to export to Cython. I have tested two implementations but I don't understand why the second one is slower than the first one. Furthermore, I am looking for ways to improve speed a little more but I have no clue how ?

Base code

import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cdef inline int int_max(int a, int b): return a if a >= b else b
cdef inline int int_min(int a, int b): return a if a <= b else b
cdef extern from "math.h":
    double exp(double x)

@cython.boundscheck(False)
@cython.wraparound(False)
def bilateral_filter_C(np.ndarray[np.float_t, ndim=1] samples, int w=20):
    # Filter Parameters
    cdef Py_ssize_t size = samples.shape[0]
    cdef float rang
    cdef float sigma = 2*3.0*3.0
    cdef int j, L
    cdef unsigned int a, b
    cdef np.float_t W, num, sub_sample, intensity

    # Initialization
    cdef np.ndarray[np.float_t, ndim=1] gauss = np.zeros(2*w+1, dtype=np.float)
    cdef np.ndarray[np.float_t, ndim=1] sub_samples, intensities = np.empty(size, dtype=np.float)
    cdef np.ndarray[np.float_t, ndim=1] samples_filtered = np.empty(size, dtype=np.float)

    L = 2*w+1
    for j in xrange(L):
        rang = -w+1.0/L
        rang *= rang
        gauss[j] = exp(-rang/sigma)


    <CODE TO IMPROVE>

    return samples_filtered

I tried to inject those two code samples in the <CODE TO IMPROVE> section:

Most efficient approach

    for i in xrange(size):            
        a = <unsigned int>int_max(i-w, 0)
        b = <unsigned int>int_min(i+w, size-1)
        L = b-a        

        sub_samples = samples[a:b]-samples[i]
        sub_samples *= sub_samples
        for j in xrange(L):
            sub_samples[j] = exp(-sub_samples[j]/sigma)
        intensities = gauss[w-i+a:w-i+b]*sub_samples

        num = 0.0
        W = 0.0        
        for j in xrange(L):
            W += intensities[j]
            num += intensities[j]*samples[a+j]

        samples_filtered[i] = num/W

Result

%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 45 ms per loop

Less efficient

    for i in xrange(size):            
        a = <unsigned int>int_max(i-w, 0)
        b = <unsigned int>int_min(i+w, size-1)

        num = 0.0
        W = 0.0        
        for j in xrange(b-a):
            sub_sample = samples[a+j]-samples[i]
            intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma)
            W += intensity
            num += intensity*samples[a+j]

        samples_filtered[i] = num/W

Result

%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 125 ms per loop

You need to highlight and explain the difference. You have two long code blocks that look the same - until I scrolled down. If it takes too much time to read and understand your question, posters will move on to easier questions. — hpaulj, Jan 19 '17 at 18:07
Post updated accordingly. The code of interest has been emphasized — GuillaumeA, Jan 19 '17 at 18:43

sebacastroh · Accepted Answer · 2017-01-20T15:52:16.897

You have a few typos:

1) You forgot to define i, just add cdef int i, j, L

2) In the second algorithm you wrote intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma), it should be intensity, without the 1

3) I would add @cython.cdivision(True) to avoid the check of division by zero

With those changes and with x = np.random.rand(10000)I got the following results

%timeit bilateral_filter_C1(x, 20) # First code
10 loops, best of 3: 74.1 ms per loop
%timeit bilateral_filter_C2(x, 20) # Second code
100 loops, best of 3: 9.5 ms per loop

And, to check the results

np.all(np.equal(bilateral_filter_C1(x, 20), bilateral_filter_C2(x, 20)))
True

To avoid these problems I suggest to use the option cython my_file.pyx -a, it generates an html file that shows you the possible problems you have in your code

EDIT

Reading again the code, it seems to have more errors:

for j in xrange(L):
    rang = -w+1.0/L
    rang *= rang
    gauss[j] = exp(-rang/sigma)

gauss has the same value always, what is the definition of rang?

Thank you !! I missed the definition of the variable `i`. That was the main issue to performance. And thanks for the gauss vector generation, it is also a copy/paste from an old code that had to be fixed.... — GuillaumeA, Jan 20 '17 at 19:35

Cython call optimization

1 Answers1