Is it possible to synchronize only a subset of the warps in a CUDA block, i.e. the effect should be between __syncwarp()
and __syncthreads()
.
The scenario is that all threads in the block execute a program like the following:
copy data from global memory to shared buffer
__syncthreads()
work in shared buffer
__syncthreads()
work in shared buffer
...
But after the first __syncthreads()
, the data that was copied by warps 3-5 (for example) is not yet needed.
Is there a way to inform the compiler when there is no data dependency, so that it does not wait until these copies are finished at the first __syncthreads()
, and hide the latency of global memory access?