Numba and Cython aren't improving the performance compared to CPython significantly, maybe I am using it incorrectly?

Question

BIG EDIT:

================

For the sake of clarity, I am removing the old results and replace it by the more recent results. The question is still the same: Am I using both Cython and Numba correctly, and what improvements to the code can be made? (I have a newer and more bare-bones temporary IPython notebook with all the code and results here)

1)

I think I figured out why there was initially no difference between Cython, Numba, and CPython: It was because I fed them

numpy arrays as input:

x = np.asarray([x_i*np.random.randint(8,12)/10 for x_i in range(n)])

instead of lists:

x = [x_i*random.randint(8,12)/10 for x_i in range(n)]

Benchmark using Numpy arrays as data input

enter image description here

Benchmark using Python lists as input

enter image description here

2)

I replaced the zip() function by explicit loops, however, it didn't make much of a difference. The code would be:

CPython

def py_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    len_x = len(x)
    x_avg = sum(x)/len_x
    y_avg = sum(y)/len(y)
    var_x = 0
    cov_xy = 0
    for i in range(len_x):
        temp = (x[i] - x_avg)
        var_x += temp**2
        cov_xy += temp*(y[i] - y_avg)
    slope = cov_xy / var_x
    y_interc = y_avg - slope*x_avg
    return (slope, y_interc)

Cython

%load_ext cythonmagic

%%cython
def cy_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    cdef double x_avg, y_avg, var_x, cov_xy,\
         slope, y_interc, x_i, y_i
    cdef int len_x
    len_x = len(x)
    x_avg = sum(x)/len_x
    y_avg = sum(y)/len(y)
    var_x = 0
    cov_xy = 0
    for i in range(len_x):
        temp = (x[i] - x_avg)
        var_x += temp**2
        cov_xy += temp*(y[i] - y_avg)
    slope = cov_xy / var_x
    y_interc = y_avg - slope*x_avg
    return (slope, y_interc)

Numba

from numba import jit

@jit
def numba_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    len_x = len(x)
    x_avg = sum(x)/len_x
    y_avg = sum(y)/len(y)
    var_x = 0
    cov_xy = 0
    for i in range(len_x):
        temp = (x[i] - x_avg)
        var_x += temp**2
        cov_xy += temp*(y[i] - y_avg)
    slope = cov_xy / var_x
    y_interc = y_avg - slope*x_avg
    return (slope, y_interc)

For your first example, I wouldn't expect numba to produce major gains, since you are doing all the computation in numpy anyway. — BrenBarn, May 08 '14 at 19:12
Thanks, any suggestions how I can implement it in Numba to compare it to CPython? From the Numba documentation I read "Numba is an just-in-time specializing compiler which compiles annotated Python and NumPy code to LLVM (through decorators). ", so I thought it would benefit from Numpy code too — , May 08 '14 at 19:15
From the examples on the Numba page, I would expect it might speed up code that uses Python-code loops over numpy structures, but your example does nothing except call numpy functions, which are already written in C. I don't know much about Numba, but my guess would be you aren't going to be able to speed up your first example. The lack of speedup on the second example is a bit more surprising, we will see if someone who knows more about Numba replies. — BrenBarn, May 08 '14 at 19:19
This is why I also implemented it as "classic" approach, where I don't use Numpy at all. I added a figure to show you that Cython is improving the performance significantly by compiling it to C, but Numba doesn't. — , May 08 '14 at 19:20
The fastest method may be to use `np.linalg.lstsq` rather than implementing the normal equations. — YXD, May 08 '14 at 19:21
Thanks, but np.linalg.lstsq is in fact slower! The fastest approach is implementing the "classic" one in Cython. I have done the benchmark Cython vs. numpy (np.linalg.lstsq) vs. scipy (scipy.stats.linregress) [here](http://sebastianraschka.com/IPython_htmls/cython_least_squares.html#showdown) — , May 08 '14 at 19:28
@BrenBarn No, that's only half of the truth ;) I am using Numpy (on purpose) only for one comparison: numba_mat_lstsqr(x, y) vs. py_mat_lstsqr(x, y). And the second comparison numba_lstsqr(x, y) vs py_lstsqr(x, y) is completely without numpy — , May 08 '14 at 20:07
@SebastianRaschka: Yes, maybe I was unclear. By "your first example" I meant the comparison between `py_mat_lstsqr` and `numba_mat_lstsqr` (which doesn't surprise me). But what I call "your second example" is the comparison between `numba_lstsqr` and `py_lstsqr` (which does surprise me). Googling around, I see a few cases where someone said Numba wasn't able to infer the types in some functions so there was no speedup, but I don't know enough about Numba to know if that's what's happening here, or how to improve it. — BrenBarn, May 08 '14 at 20:10
@BrenBarn Thanks for the clarification! Yes, this is surprising me too, and I am just wondering if there is something to be done about it. When I use Cython for example, I have a ~30x improvements when I don't change the code at all, and a ~80x improvement when I add static type declarations — , May 08 '14 at 20:23
@SebastianRaschka, Cython *is not giving you a speed boost without type declarations*. Your `%pylab inline` call is overwriting the `sum` builtin, which messes up your timings. I've told you this. [I wrote this bunch of code to *prove* it to you.](https://github.com/Veedrac/Time-Least-Squares) So why are you still saying "I have a ~30x improvements when I don't change the code at all"?... — Veedrac, May 09 '14 at 14:01
Is this about CPython or Cython? Maybe changing the title of the question should be considered. — Jonas Schäfer, May 09 '14 at 14:16
@Veedrac Yes, but this shouldn't be a problem, since the timing is done BEFORE I load %pylab — , May 09 '14 at 15:02
@JonasWielicki actually it is Numba vs. CPython, and just for comparison I added Cython vs. CPython — , May 09 '14 at 15:03
@Veedrac I will go ahead and replace those %pylab inlines by %matplotlib inlines, which I believe aren't doing this. Also I will make sure to restart the kernel after each timing/plotting step. This will take a while, also I will remove the zip() to see how Numba will do then... thanks again — , May 09 '14 at 18:05
@SebastianRaschka I don't see how you can say that when it is literally the third block in your iPython notebook. :/ — Veedrac, May 09 '14 at 18:16
I updated the original post - the problem was caused by numpy arrays that I passed as input ... but I am still not happy with the performance, or is this the most I can get for this code? — , May 09 '14 at 21:15

score 3 · Accepted Answer · answered May 09 '14 at 14:12

3

Here's what I think is happening with Numba:

Numba works on Numpy arrays. Nothing else. Everything else has nothing to do with Numba.

zip returns an iterator of arbitrary items, which Numba cannot see into. Thus Numba cannot do much compiling.

Looping over the indexes with a for i in range(...) is likely to produce a much better result and allow much stronger type inference.

answered May 09 '14 at 14:12

Veedrac

58,273
15
112
169

Thanks! I will rewrite it and redo the benchmark this weekend. I will let you know about the results! – May 09 '14 at 18:03
1

the zip didn't do that much ... I replaced it now, but the real problem was the numpy array that I passed as input. – May 09 '14 at 21:16
As I understand, it's not so much that Numba can't see into anything outside of Numpy, but that the way it handles such things is no improvement. Check out [Numba - Tell Those C++ Bullies to Get Lost](https://www.youtube.com/watch?v=1AwG0T4gaO0) for a great explanation of a lot of Numba. – Post169 Jun 13 '18 at 16:01

score 2 · Answer 2 · answered Apr 18 '15 at 11:13

Using the builtin sum() could be causing problems.

Here's linear regression code that will run faster in Numba:

@numba.jit
def ols(x, y):
    """Simple OLS for two data sets."""
    M = x.size

    x_sum = 0.
    y_sum = 0.
    x_sq_sum = 0.
    x_y_sum = 0.

    for i in range(M):
        x_sum += x[i]
        y_sum += y[i]
        x_sq_sum += x[i] ** 2
        x_y_sum += x[i] * y[i]

    slope = (M * x_y_sum - x_sum * y_sum) / (M * x_sq_sum - x_sum**2)
    intercept = (y_sum - slope * x_sum) / M

    return slope, intercept