I'm writing a code for real-time processing of an image from a camera. I am using Python 3.5 with Anaconda Accelerate/Numba packages to perform most of the calculations on the GPU. I have problems with implementing a function which will find the position of a largest element in a float32 2d array. The array is already in the GPU memory. The problem is: it is terribly slow. It is the bottleneck of my whole code. The code:
@n_cuda.jit('void(float32[:,:], float32, float32, float32)')
def d_findcarpeak(temp_mat, height, width, peak_flat):
row, col = cuda.grid(2)
if row < height and col < width:
peak_flat = temp_mat.argmax()
Here is where I call it:
d_findcarpeak[number_of_blocks, threads_per_block](
d_temp_mat, height, width, d_peak_flat)
How can I rewrite this code?