My CUDA application performs an associative reduction over a volume. Essentially each thread computes values which are atomically added to overlapping locations of the same output buffer in global memory.
Is it possible to concurrently launch this kernel with different input parameters and the same output buffer? In other words, each kernel would share the same global buffer and write to it atomically.
All kernels are running on the same GPU.