3

could you please help me understand how to write CUDA kernels in Python? AFAIK, numba.vectorize can be performed on cuda, cpu, parallel(multi-cpus), based on target. But target='cuda' requires to set up CUDA kernels.

The main issue is that many examples, answers in Internet are related to deprecated NumbaPro library, so it's hard to follow to such as not-updated WIKIs, especially if you're newbie.

I have:

  • latest Anaconda (v2)
  • latest Numba (v0.25)
  • latest CUDA toolkit (v7)

Here is the error I'm getting:

numba.cuda.cudadrv.driver.CudaAPIError: 1 Call to cuLaunchKernel results in CU DA_ERROR_INVALID_VALUE

import numpy as np
import time

from numba import vectorize, cuda

@vectorize(['float32(float32, float32)'], target='cuda')
def VectorAdd(a, b):
    return a + b

def main():
    N = 32000000

    A = np.ones(N, dtype=np.float32)
    B = np.ones(N, dtype=np.float32)

    start = time.time()
    C = VectorAdd(A, B)
    vector_add_time = time.time() - start

    print "C[:5] = " + str(C[:5])
    print "C[-5:] = " + str(C[-5:])

    print "VectorAdd took for % seconds" % vector_add_time

if __name__ == '__main__':
    main()
Novitoll
  • 820
  • 1
  • 9
  • 22
  • 1
    There is nothing wrong with the code you posted. I can run it without errors. I can think of two possibilities- (a) your numba installation is totally broken, or (b) your GPU has very little memory. You are allocating three 128Mb vectors on the device, if the GPU doesn't have much memory you might be running out. Try reducing N to something much smaller and see what happens – talonmies Apr 08 '16 at 15:00
  • @talonmies, wow, it worked with N=10 million, and fails with 20 millions. Could you please tell me how you calculated 3 x 128 Mb? I have GeForce 820M, it's memory is 2Gb, I believe – Novitoll Apr 08 '16 at 16:21
  • 1
    32000000 * 4 = 128Mb. You might also be hitting the watchdog timer limits for your GPU if it runs a display and doesn't have a lot of compute capacity – talonmies Apr 08 '16 at 16:46
  • @talonmies, Thanks, got it. Actuall this code is running slower on GPU than when target="cpu" VectorAdd on CPU: took for 0.0160000324249 seconds VectorAdd on GPU took for 0.695999860764 seconds But I think I have something wrong installed, because "nvprof" says "No kernels were profiled.", it seems I don't use my GPU at all. – Novitoll Apr 08 '16 at 16:55
  • The nvprof issue doesn't mean your GPU isn't being used -- it is. For the profiler to work, a particular CUDA API must be called before your program exits. It probably means that the internal Numbapro runtime doesn't call that API on exit so the profiler can't grab statistics – talonmies Apr 09 '16 at 09:09

1 Answers1

1

The code, as posted, is correct and will run on a Python 2 Numbapro/Accelerate system without error.

It was likely that the particular system being used to run the code wasn't very large in capacity and was hitting a display driver watchdog or free memory error with 32 million element vectors. Reducing the size of the input data allowed the code to run correctly.

[This answer assembled from comments and added as a community wiki entry to get this question off the unanswered list]

talonmies
  • 70,661
  • 34
  • 192
  • 269