CUDA GPU memory allocation - is there a way of dynamically allocating memory within a loop?

Question

I am refactoring a piece of code for a project, the aim is to reduce the runtime as much as possible, and I am using PyCuda to run a big loop on a GPU.

The kernel needs to follow this basic logic:

for pixel in pixels: (this is the input array and currently I have a thread for each pixel)

        for electrons in pixel:

                 e_array = all zeros

                  
                 add something to  e_array if condition satisfied

                 add something to e_array  if another condition satisfied

                add something to  e_array  if another condition satisfied

                then multiply constant and round all the values of the e_array down to      nearest integer 

            add e_array to new_image array

return new_image array

The issue I'm encountering is that the size of e_array is not known at compile time so I can't just declare it in the inner loop.

What would be the best way of keeping a separate array e_array for each electron (without knowing the number of electrons per pixel) then adding to a shared memory array after rounding down.

(The reason I can't set a fixed number of max electrons is that this could be 10 there could be 100 per pixel - there is a random element so the max number per pixel wouldn't be known until the end of the simulation. I suppose the theoretical maximum per pixel is the total number of electrons across all pixels but this could be millions and would impact performance.)

Currently I allocate memory for the e_array on host and mass as an argument to the kernel, but this means memory is shared across all electrons.

I have tried to use dynamic memory allocation (malloc function in CUDA) within the loop then free up the memory after every loop but when I compile it I get an illegal memory error - I don't think this is allowed within a loop - does anyone know if this is possible?

Did you check the pointer returned by in-kernel malloc? Did you increase the device heap size for in-kernel malloc? — Abator Abetor, Aug 04 '23 at 14:50
Compute the number of electrons per pixel, use the total number of electrons to allocate memory. The prefix sum of electrons per pixel can be used to determine the starting adress of memory for each pixel — Abator Abetor, Aug 04 '23 at 14:53

CUDA GPU memory allocation - is there a way of dynamically allocating memory within a loop?

0 Answers0