CUDA: Is it possibile to synchronize threads based on threadIdx?

Question

As the title says, I wonder whether it is possible to launch a sort of __syncthreads() function, where the barrier is not at block level but at sub-block level, so that I can sync all threads having a particular threadIdx.x?

For instance, if I define a kernel launch as <<<1, (32, 32)>>>, is it possible to define something like __syncthreads(5) so that it syncs all threads having threadIdx.x == 5?

Following the documentation, it seems that CUDA does not define such a function; however, I wonder whether there exists some trick that can achieve the same result.

score 3 · Answer 1 · answered Jul 14 '22 at 15:29

Generally, no this is not possible in CUDA. There are no provided methods to do this.

CUDA does provide __syncwarp() which allows synchronization of a warp (32 threads).

The CUDA cooperative groups mechanism does allow for synchronization "only" of subgroups of threads. But you do not have an arbitrary method to assign threads to groups.

At the PTX level, there is more flexibility in the use of barriers. But you don't have the ability to assign an arbitrary set of threads to a barrier. (Instead, for example, arriving threads may simply be "counted").

My suggestion would be to use one of the above methods. For example, if you wanted to assign all threads with threadIdx.x == 5 in a (32,32) threadblock, that is 32 threads the same as a warp. Reassign your thread assignment pattern so that those 32 threads belong to the same warp, and use __syncwarp().

Or synchronize all 1024 threads ... – Sebastian Jul 14 '22 at 17:02 — Sebastian, Jul 14 '22 at 17:02

CUDA: Is it possibile to synchronize threads based on threadIdx?

1 Answers1