could you please help me understand how to write CUDA kernels in Python? AFAIK, numba.vectorize can be performed on cuda, cpu, parallel(multi-cpus), based on target. But target='cuda' requires to set up CUDA kernels.
The main issue is that many examples, answers in Internet are related to deprecated NumbaPro library, so it's hard to follow to such as not-updated WIKIs, especially if you're newbie.
I have:
- latest Anaconda (v2)
- latest Numba (v0.25)
- latest CUDA toolkit (v7)
Here is the error I'm getting:
numba.cuda.cudadrv.driver.CudaAPIError: 1 Call to cuLaunchKernel results in CU DA_ERROR_INVALID_VALUE
import numpy as np
import time
from numba import vectorize, cuda
@vectorize(['float32(float32, float32)'], target='cuda')
def VectorAdd(a, b):
return a + b
def main():
N = 32000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
start = time.time()
C = VectorAdd(A, B)
vector_add_time = time.time() - start
print "C[:5] = " + str(C[:5])
print "C[-5:] = " + str(C[-5:])
print "VectorAdd took for % seconds" % vector_add_time
if __name__ == '__main__':
main()