0

In CUDA device code, the following if-else statement will cause divergence among the threads of a warp, resulting in two passes by the SIMD hardware. Assume Vs is a location in shared memory.

if (threadIdx.x % 2) {
  Vs[threadIdx.x] = 0;
} else {
  Vs[threadIdx.x] = 1;
}

I believe there will also be two passes when we have an if statement, with no else branch. Why is this the case?

if (threadIdx.x % 2) {
  Vs[threadIdx.x] = 0;
}

Would the following if statement be completed in 3 passes?

if        (threadIdx.x < 10) {
  Vs[threadIdx.x] = 0;
} else if (threadIdx.x < 20) {
  Vs[threadIdx.x] = 1;
} else {
  Vs[threadIdx.x] = 2;
}
user2023370
  • 10,488
  • 6
  • 50
  • 83

1 Answers1

1

On a GPU, it could very well be the case that there is only one pass with an if-else statement - one predicated pass. The condition will just turn on the "do nothing" bit for half the threads during the "then" block, and turn the other half's "do nothing" bit off for the "else" block.

As @njuffa points out, however, this is dependent upon parameters such as the target architecture etc.

For more details, see:

Branch predication on GPU


For your first specific example of an if body, a compiler might not even need a predicated pass, since it can be rewritten as

Vs[threadIdx.x] = (threadIdx.x % 2 ? 0 : 1);

and that's perfectly uniform across your warp. For your last example - it really depends, but again it could theoretically be optimized by the compiler into a single unpredicated pass, and it also might be the case that you'll have a predicated single path, with different predication within each of the three scopes.

Community
  • 1
  • 1
einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • 2
    The conditionally executed block of code may be predicated, or there may be a branch around it, or there may a uniform branch plus predication. It is impossible to state up-front which choice the compiler will make, other than that the likelihood of predication increases with decreasing block size. Other than that, it is up to target architecture, optimization level, etc, etc. Inspecting the machine code with `cuobjdump --dump-sass` will tell which choice the compiler made. – njuffa Feb 13 '16 at 21:27
  • Are you sure? The reference you provide is very brief, and even relates to an `if-else` (with 40+50 (90) cycles required); rather than the single `if` statement I'm asking about. – user2023370 Feb 14 '16 at 10:51
  • @user2023370: Well, I'm not _sure_ sure, but at least for the first example - if it doesn't happen in one pass, nvcc is incredibly lame. Seriously, though, just compile your kernel with `--ptx` passed to nvcc and you'll (sort of) see what actually happens. Don't worry, you don't have to know the specifics of PTX, you'll figure it out. – einpoklum Feb 14 '16 at 15:37