I am refactoring a piece of code for a project, the aim is to reduce the runtime as much as possible, and I am using PyCuda to run a big loop on a GPU.
The kernel needs to follow this basic logic:
for pixel in pixels: (this is the input array and currently I have a thread for each pixel)
for electrons in pixel:
e_array = all zeros
add something to e_array if condition satisfied
add something to e_array if another condition satisfied
add something to e_array if another condition satisfied
then multiply constant and round all the values of the e_array down to nearest integer
add e_array to new_image array
return new_image array
The issue I'm encountering is that the size of e_array
is not known at compile time so I can't just declare it in the inner loop.
What would be the best way of keeping a separate array e_array
for each electron (without knowing the number of electrons per pixel) then adding to a shared memory array after rounding down.
(The reason I can't set a fixed number of max electrons is that this could be 10 there could be 100 per pixel - there is a random element so the max number per pixel wouldn't be known until the end of the simulation. I suppose the theoretical maximum per pixel is the total number of electrons across all pixels but this could be millions and would impact performance.)
Currently I allocate memory for the e_array
on host and mass as an argument to the kernel, but this means memory is shared across all electrons.
I have tried to use dynamic memory allocation (malloc
function in CUDA) within the loop then free up the memory after every loop but when I compile it I get an illegal memory
error - I don't think this is allowed within a loop - does anyone know if this is possible?