Is __syncthreads() in conditional code always be run even if it's inside an 'inactive' execution path?

Question

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

I tried to make the kernel hang by the following code:

#include <stdio.h>

__global__ void test(int warpSize)
{
    int i = threadIdx.x;
    if (i < warpSize) {
        __syncthreads();
    }
    else {
        __syncthreads();
    }
}

int main(int argc,char **argv)
{
    int device; 
    cudaDeviceProp prop; 
    cudaGetDevice(&device);
    cudaGetDeviceProperties(&prop, device);

    test<<<1, 2 * prop.warpSize>>>(prop.warpSize);

    printf("done");
    return 0;
}

But the program exited normally.

To my understanding, there are two barriers in the kernel. The barrier within the if-block will wait for the completion of warp#1, and the barrier within the else-block will wait for the completion of warp#0. Did I misunderstand __syncthreads()? Or __syncthreads() in conditional code always be run even if it's inside an 'inactive' execution path?

It is very like that the compiler is smarter than you and is optimising the entire kernel away. — talonmies, Dec 25 '16 at 16:51
Note that it doesn't say "it will always hang". You're exploring undefined behavior (UB). That means anything can happen, and it's difficult or impossible to explain the behavior. It may change with GPU, CUDA version, compiler version, or even run-to-run. That means even if someone gave you an explanation, it could change tomorrow. Therefore asking for explanations for UB may be unsatisfying. If you want a more detailed **discussion** of syncthreads behavior, you may want to read [this](http://stackoverflow.com/questions/6666382/can-i-use-syncthreads-after-having-dropped-threads). — Robert Crovella, Dec 25 '16 at 16:56
Since you have no CPU thread synchronization after your kernel call, even a hung kernel would not cause your program to hang; it would terminate "normally" regardless of kernel behavior. I'm not suggesting this is the explanation for what you are witnessing, so you don't need to come back and say "I added `cudaDeviceSynchronize()` but it still completes normally". I'm just pointing something out that you may want to be aware of if you go on a quest to make a kernel hang. — Robert Crovella, Dec 25 '16 at 17:01
Thank you @RobertCrovella. The [link](http://stackoverflow.com/questions/6666382/can-i-use-syncthreads-after-having-dropped-threads) is quite informative. — Jenny, Dec 25 '16 at 17:36

score 1 · Accepted Answer · answered Dec 25 '16 at 18:02

According to the comments, the code should be more complicated so that the compiler won't optimize the kernel away. Besides, the CPU thread will not be blocked by some hung kernel if there is no synchronization.

Modified code:

#include <stdio.h>

__global__ void test(int warpSize, int *d_dummy)
{
    int i = threadIdx.x;
    __shared__ int tmp;
    tmp = 0;
    __syncthreads();

    if (i < warpSize) {
        tmp += 1;
        __syncthreads();
        tmp += 2;
    }
    else {
        tmp -= 3;
        __syncthreads();
        tmp -= 4;
    }
    __syncthreads();
    d_dummy[0] = tmp;
}

int main(int argc,char **argv)
{
    int device; 
    cudaDeviceProp prop; 
    cudaGetDevice(&device);
    cudaGetDeviceProperties(&prop, device);

    int h_dummy[1], *d_dummy;
    cudaMalloc(&d_dummy, 1 * sizeof(int));

    test<<<1, 2 * prop.warpSize>>>(prop.warpSize, d_dummy);
    cudaMemcpy(h_dummy, d_dummy, 1 * sizeof(int), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    printf("done %d", h_dummy[0]);
    return 0;
}

However, the behavior of __syncthreads() is undefined when warps within a block are not on the same execution path. So we cannot expect the program to hang.

Is __syncthreads() in conditional code always be run even if it's inside an 'inactive' execution path?

1 Answers1