0

Is it possible to synchronize only a subset of the warps in a CUDA block, i.e. the effect should be between __syncwarp() and __syncthreads().

The scenario is that all threads in the block execute a program like the following:

copy data from global memory to shared buffer
__syncthreads()
work in shared buffer
__syncthreads()
work in shared buffer
...

But after the first __syncthreads(), the data that was copied by warps 3-5 (for example) is not yet needed.

Is there a way to inform the compiler when there is no data dependency, so that it does not wait until these copies are finished at the first __syncthreads(), and hide the latency of global memory access?

tmlen
  • 8,533
  • 5
  • 31
  • 84
  • @talonmies This shouldn't be marked as dupe. Pointing to barrier.sync is correct but it is misleading. One can use Cooperative Groups [1] to prepartition threadblock so each thread group can be synchronized independently of each other. Using CG is better way of solving this problem as it abstracts out implementation to Nvidia library that can deal with problem in architecture optimal way. Handcoded barrier.sync solution is likely to age out quickly. [1] https://devblogs.nvidia.com/cooperative-groups/ – llukas May 12 '19 at 23:02
  • But with cooperative groups it is not possible to make partitions larger than 32 threads, so it can't synchronize multiple warps. – tmlen May 13 '19 at 08:27

0 Answers0