You can use malloc()
inside kernel code. However, it's rarely a good idea to do so. It's usually much better to pre-allocate scratch space before the kernel is launched, pass it as an argument, and let each thread, or group of threads, have some formula for determining the location they will use for their common atomics within that scratch area.
Now, you've written this isn't very "user-friendly"; I guess you mean developer-friendly. Well, it can be made more friendly! For example, my CUDA Modern C++ API wrappers library offers an equivalent of std::unique_ptr
- but for device memory:
#include <cuda/api_wrappers.hpp>
//... etc. etc. ...
{
auto scratch = cuda::memory::device::make_unique<float[]>(1024, my_cude_device_id);
my_kernel<<<blah,blah,blah>>>(output, input, scratch.get();
} // the device memory is released here!
(this is for synchronous launches of course.)
Something else you can do be more developer-friendly is use some kind of proxy function to get the location in that scratch memory relevant to a specific thread / warp / group of threads / whatever, which uses the same address for atomics. That should at least hide away some of the repeating, annoying, address arithmetic your kernel might be using.
There's also the option of using global __device__
variables (like @RobertCrovella mentioned), but I wouldn't encourage that: The size would have to be fixed at compile time, and you wouldn't be able to use if from two kernels at once without it being painful, etc.