Cupy is slower than numpy

Question

I tried to speed up my python code with cupy instead of numpy. The problem here is, that using cupy, my code got drastically slower. Maybe I went a little bit to naive on that problem.

Maybe anyone can find a bottleneck in my code:

import cupy as np
import time as ti

def f(y, t):
    y_ = np.zeros(2 * N_1*N_2) # n: e-6, c: e-5
    for i in range(0, N_1*N_2):
        y_[i] = y[i + N_1*N_2] # n: e-7, c: e-5 or e-6
    for i in range(N_1*N_2):
        sum = -4*y[i] # n: e-7, c: e-7 after some statements e-5
        if (i + 1 in indexes) and (not (i in indi)):
            sum += y[i+1] # n: e-7, c: e-7 after some statements e-5
        if (i - 1) in indexes and (i % N_1 != 0):
            sum += y[i-1] # n: e-7, c: e-7 after some statements e-5
        if i + N_1 in indexes:
            sum += y[i+N_1] # n: e-7, c: e-7 after some statements e-5
        if i - N_1 in indexes:
            sum += y[i-N_1] # n: e-7, c: e-7 after some statements e-5
        y_[i + N_1*N_2] = sum

    return y_

def k_1(y, t, h):
    return np.asarray(f(y, t)) * h

def k_2(y, t, h):
    return np.asarray(f(np.add(np.asarray(y) , np.multiply(1/2 , k_1(y, t, h))), t + 1/2 * h)) * h

# k_2, k_4 look just like k_2, may be with an 1/2 here or there

# some init stuff is happening here

while t < T_end:
    # also some magic happening here which is just data saving
    y = np.asarray(y) + 1/6*(k_1(y, t, m) + 2*k_2(y, t, m) + 2*k_3(y, t, m) + k_4(y, t, m))
    t += m

EDIT I tried to benchmark my code and here are some results they can be seen as a comment in the code. Each number stays for one line. The units are seconds. n: Numpy, c:CuPy, i mostly give a rough estimate of the order. Additional i tested

np.multiply # n: e-6, c: e-5

and

np.add # n: e-5 or e-6, c: 0.005 or e-5

@syntonym i edited the question. The results were pretty surprising. — Bomel, Jul 04 '18 at 18:38

score 1 · Accepted Answer · answered Jul 04 '18 at 19:03

Your code is not slow because numpy is slow but because you call many (python) functions, and calling functions (and iterating and accessing objects and basically everything in python) is slow in python. Thus cupy will not help you (but probably harm performance because it has to do more setup e.g. copying data over to the gpu). If you can formulate your algorithm to use less python functions (vectorizing as in the other answer) this will speedup your code tremendously (you probably do not need cupy).

You could also look into numba which compiles your code with llvm in native code. If you do so be sure to read some documenation and use nopython=True, otherwise you will only switch slow cupy code with slow numba code.

I will also take a look at you suggestion too. Do you know if a return statement makes the array stay in the GPU? — Bomel, Jul 04 '18 at 19:53
I've not seen cupy before your question but as far as I understand it the data will be on the GPU and python only has a reference to it. You could also look into writing [your own kernel code for cupy](https://docs-cupy.chainer.org/en/stable/tutorial/kernel.html). — syntonym, Jul 04 '18 at 21:15

score 0 · Answer 2 · answered Jul 04 '18 at 18:52

Your code example doesn't work since you haven't defined N_1, N_2, indexes and indi anywhere. Also your comments in the code doesn't seem to help others understand what's going on. Your code probably won't benefit from numba/cupy since you haven't vectorized the operations in your code. Lists would probably be just as fast as numpy arrays in the way your code works at the moment.

If you get rid of your for loops and change

y_ = np.zeros(2 * N_1*N_2)
for i in range(0, N_1*N_2):
    y_[i] = y[i + N_1*N_2]

to

n = N1*N2
y_ = np.zeros(2*n)
y_[:n] = y[n:2*n]

and so forth, you will speed your code up substantially.

i commented the definition of N_1, N_2, indexes and indi out, because stackoverflow would't let me post that much code. I will try to implement vectorizing. — Bomel, Jul 04 '18 at 19:50

Cupy is slower than numpy

2 Answers2