As the title says, I wonder whether it is possible to launch a sort of __syncthreads()
function, where the barrier is not at block level but at sub-block level, so that I can sync all threads having a particular threadIdx.x?
For instance, if I define a kernel launch as <<<1, (32, 32)>>>, is it possible to define something like __syncthreads(5)
so that it syncs all threads having threadIdx.x == 5?
Following the documentation, it seems that CUDA does not define such a function; however, I wonder whether there exists some trick that can achieve the same result.