The CUDA C Programming Guide says
__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.
I tried to make the kernel hang by the following code:
#include <stdio.h>
__global__ void test(int warpSize)
{
int i = threadIdx.x;
if (i < warpSize) {
__syncthreads();
}
else {
__syncthreads();
}
}
int main(int argc,char **argv)
{
int device;
cudaDeviceProp prop;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
test<<<1, 2 * prop.warpSize>>>(prop.warpSize);
printf("done");
return 0;
}
But the program exited normally.
To my understanding, there are two barriers in the kernel. The barrier within the if-block will wait for the completion of warp#1, and the barrier within the else-block will wait for the completion of warp#0. Did I misunderstand __syncthreads()
? Or __syncthreads()
in conditional code always be run even if it's inside an 'inactive' execution path?