Speed-up cython code

Question

I have code that is working in python and want to use cython to speed up the calculation. The function that I've copied is in a .pyx file and gets called from my python code. V, C, train, I_k are 2-d numpy arrays and lambda_u, user, hidden are ints. I don't have any experience in using C or cython. What is an efficient way to make this code faster. Using cython -a for compiling shows me that the code is flawed but how can I improve it. Using for i in prange (user_size, nogil=True): results in Constructing Python slice object not allowed without gil.

How has the code to be modified to harvest the power of cython?

 @cython.boundscheck(False)
 @cython.wraparound(False)
 def u_update(V, C, train, I_k, lambda_u, user, hidden):
    cdef int user_size = user
    cdef int hidden_dim = hidden
    cdef np.ndarray U = np.empty((hidden_dim,user_size), float)
    cdef int m = C.shape[1]

    for i in range(user_size):
        C_i = np.zeros((m, m), dtype=float)
        for j in range(m):
            C_i[j,j]=C[i,j]

        U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))

    return U

To my opinion, not sure you can improve execution speed since you mainly use `numpy` in this code which is already compiled/optimized. Have you profiled your code? — Laurent LAPORTE, Sep 28 '16 at 11:23
That `U` expression is too complex for `cython` to speed up. The `C_i` can be done with a `diagonal` function. — hpaulj, Sep 28 '16 at 11:28
`user_size` is around 1000. No, I haven't profiled the code yet. I was first using the `np.diag` function to compute `C_i` but had hoped it would be faster with parallelized loop. So to speed it up I would have to rewrite the calculation of `U`? — саша, Sep 28 '16 at 11:36
This can be speedup by generalized ufunc `numpy.linalg.solve`. `dot(inv(A), B)` can be calculated by `solve(A, B)`. and dot with diagonal matrix can be replaced by multiple broadcast. If you give some test data, I can show you how to do this. — HYRY, Sep 28 '16 at 12:13
**Never, ever** use `inv` in a production code. Use either `solve` or `lstsq` — percusse, Sep 28 '16 at 19:41

hpaulj · Accepted Answer · 2016-09-28T19:30:41.610

You are trying to use cython by diving into the deep end of pool. You should start with something small, such as some of the numpy examples. Or even try to improve on np.diag.

    i = 0
    C_i = np.zeros((m, m), dtype=float)
    for j in range(m):
        C_i[j,j]=C[i,j]

v.

    C_i = diag(C[i,:])

Can you improve the speed of this simple expression? diag is not compiled, but it does perform an efficient indexed assignment.

 res[:n-k].flat[i::n+1] = v

But the real problem for cython is this expression:

U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))

np.dot is compiled. cython won't turn that in to c code, nor will it consolidate all 5 dots into one expression. It also won't touch the inv. So at best cython will speed up the iteration wrapper, but it will still call this Python expression m times.

My guess is that this expression can be cleaned up. Replacing the inner dots with einsum can probably eliminate the need for C_i. The inv might make 'vectorizing' the whole thing difficult. But I'd have to study it more.

But if you want to stick with the cython route, you need to transform that U expression into simple iterative code, without calls to numpy functions like dot and inv.

===================

I believe the following are equivalent:

np.dot(C_i,V.T)
C[i,:,None]*V.T

In:

np.dot(C_i,train[i,:].T)

if train is 2d, then train[i,:] is 1d, and the .T does nothing.

In [289]: np.dot(np.diag([1,2,3]),np.arange(3))
Out[289]: array([0, 2, 6])
In [290]: np.array([1,2,3])*np.arange(3)
Out[290]: array([0, 2, 6])

If I got that right, you don't need C_i.

======================

Furthermore, these calculations can be moved outside the loop, with expressions like (not tested)

CV1 = C[:,:,None]*V.T   # a 3d array
CV2 = C * train.T  

for i in range(user_size):
    U[:,i] = np.dot(np.linalg.inv(np.dot(V, CV1[i,...]) + lambda_u*I_k), np.dot(V, CV2[i,...]))

A further step is to move both np.dot(V,CV...) out of the loop. That may require np.matmul (@) or np.einsum. Then we will have

for i...
    I = np.linalg.inv(VCV1[i,...])  
    U[:,i] = np.dot(I+ lambda_u), VCV2[i,])

or even

for i...
     I[...i] = np.linalg.inv(...) # if inv can't be vectorized
U = np.einsum(..., I+lambda_u, VCV2)

This is a rough sketch, and details will need to be worked out.

Yes, you are right. I don't need to calculate `C_i`. This way the program is already 3-4 times faster. Moving the calculation (`CV1` and `CV2=C*train`) outside hasn't yet resulted in an improvement. Your last proposals I haven't tested yet. — саша, Sep 28 '16 at 22:45

score 2 · Answer 2 · answered Sep 28 '16 at 12:09

The first thing that comes to mind is you haven't typed the function arguments and specified the data type and number of dimensions like so :

def u_update(np.ndarray[np.float64, ndim=2]V, np.ndarray[np.float64, ndim=2]\
C, np.ndarray[np.float64, ndim=2] train, np.ndarray[np.float64, ndim=2] \
I_k, int lambda_u, int user, int hidden) :

This will greatly speed up indexing with 2 indices like you do in the inner loop.

It's best to do this to the array U as well, although you are using slicing:

cdef np.ndarray[np.float64, ndim=2] U = np.empty((hidden_dim,user_size), np.float64)

Next, you are redefining C_i, a large 2-D array every iteration of the outer loop. Also, you have not supplied any type information for it, which is a must if Cython is to offer any speedup. To fix this :

cdef np.ndarray[np.float64, ndim=2] C_i = np.zeros((m, m), dtype=np.float64)
    for i in range(user_size):
        C_i.fill(0)

Here, we have defined it once (with type information), and reused the memory by filling with zeros instead of calling np.zeros() to make a new array every time.

Also, you might want to turn off bounds checking only after you have finished debugging.

If you need speedups in the U[:,i]=... step, you could consider writing another function with Cython to perform those operations using loops.

Do read this tutorial which should give you an idea of what to do when working with Numpy arrays in Cython and what not to do as well, and also to appreciate how much of a speedup you can get with these simple changes.

Your advice regarding the data types is certainly correct but in my case there was no noticeable gain because most of the time was spent for the calculation of `U[:,i]`. — саша, Sep 28 '16 at 22:51
That's understandable. However, my answer wasn't meant for optimizing this specific piece of code, which the (rightfully) accepted answer has covered. These are just good practices which I highly recommend you follow when using Cython in the future. And since I can't comment on the other answer, I also suggest to use `np.linalg.solve(A,b)` instead of `dot(inv(A),b)` as this will get the result doing much less work than finding the inverse and multiplying. Happy Cythoning ! — NILobachevsky, Sep 29 '16 at 13:08

Speed-up cython code

2 Answers2