I have a piece of code that uses Numbapro to write a simple kernel to square the contents of two arrays of size 41724,add them together and store it into another array. All the arrays have the same size and are float32. The code is below:
import numpy as np
from numba import *
from numbapro import cuda
@cuda.jit('void(float32[:],float32[:],float32[:])')
def square_add(a,b,c):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
#Since the length of a is 41724 and the total
#threads is 41*1024 = 41984, this check is necessary
if (i>len(a)):
return
else:
c[i] = a[i]*a[i] + b[i]*b[i]
a = np.array(range(0,41724),dtype = np.float32)
b = np.array(range(41724,83448),dtype=np.float32)
c = np.zeros(shape=(1,41724),dtype=np.float32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c,copy=False)
#Launch the kernel; Gridsize = (1,41),Blocksize=(1,1024)
square_add[(1,41),(1,1024)](d_a,d_b,d_c)
c = d_c.copy_to_host()
print c
print len(c[0])
The values I am getting when I print the result of the operation (the array c) is completely different compared to that when I do the exact same thing in a python terminal. I do not know what I am doing wrong here.