0

As per my knowledge, atomicAdd can be used on shared memory and global memory. I need to atomically add floating point numbers from threads of different blocks; hence, I need to use a global temporary to hold the sum.

Is there a way to allocate temporary globals from inside a kernel?

Currently, I allocate a temporary global and pass a pointer to my kernel. This doesn't appear to be very user-friendly.

TL;DR: require a temporary variable for atomic addition across different blocks without the need to explicitly allocate a global and pass a pointer to it to the kernel

Yashas
  • 1,154
  • 1
  • 12
  • 34
  • 1
    I don't know why you say it isn't user-friendly. It might be possible to allocate from within the kernel code, but you would have to have one thread do it and then pass the address to all other threads and blocks that needed it. Doing that in a safe fashion would require some kind of synchronization in kernel code. To do that you would either need cuda cooperative groups or .... atomics. So you'd be chasing your tail and bringing a lot of baggage along - for what? You can declare statically a `__device__` variable that is usable by all threads without any host initialization. – Robert Crovella May 08 '19 at 19:39

1 Answers1

-1

You can use malloc() inside kernel code. However, it's rarely a good idea to do so. It's usually much better to pre-allocate scratch space before the kernel is launched, pass it as an argument, and let each thread, or group of threads, have some formula for determining the location they will use for their common atomics within that scratch area.

Now, you've written this isn't very "user-friendly"; I guess you mean developer-friendly. Well, it can be made more friendly! For example, my CUDA Modern C++ API wrappers library offers an equivalent of std::unique_ptr - but for device memory:

#include <cuda/api_wrappers.hpp>

//... etc. etc. ...

{
    auto scratch = cuda::memory::device::make_unique<float[]>(1024, my_cude_device_id);
    my_kernel<<<blah,blah,blah>>>(output, input, scratch.get();
} // the device memory is released here!

(this is for synchronous launches of course.)

Something else you can do be more developer-friendly is use some kind of proxy function to get the location in that scratch memory relevant to a specific thread / warp / group of threads / whatever, which uses the same address for atomics. That should at least hide away some of the repeating, annoying, address arithmetic your kernel might be using.


There's also the option of using global __device__ variables (like @RobertCrovella mentioned), but I wouldn't encourage that: The size would have to be fixed at compile time, and you wouldn't be able to use if from two kernels at once without it being painful, etc.

einpoklum
  • 118,144
  • 57
  • 340
  • 684