How to avoid executing both branches of conditional in CUDA program if it is known that the condition is the same for all threads in a warp?

Question

It is my understanding that if I have CUDA code of the form:

if (condition) {
    // do x
}
else {
    //do y
}

Then due to the SIMT execution of threads in a warp, the execution of the conditional will be serialized and all threads will be required to run both the x and y sections of the code. The exception to this is if the branches are big, in which case the compiler will insert a check using __any to avoid unnecessarily running code.

However, if I already know ahead of time that all threads in a warp will have the same value of condition, then this __any operation is unnecessary, merely serving to slow down my code.

I am wondering if there exists any way to instruct the compiler not to include this voting operation, but instead to assume that the evaluation of the condition is the same for all threads in the warp, and to run only the corresponding block of code?

I'm skeptical of the claim about the insertion of `__any` by the compiler. I've written complicated code on both sides of a conditional, and I don't see it in the SASS. I realize you can find [info on the web that seems to substantiate this](https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec3-2x2.pdf), but I think that is dated info, and an implementation detail that may not be true anymore. In any event there is no such [documented instruction to the compiler](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html) and warp vote is a relatively low cost instruction. — Robert Crovella, Aug 31 '20 at 22:58
To the contrary, [NVIDIA materials](http://developer.download.nvidia.com/GTC/PDF/1083_Wang.pdf) (slide 49) suggest that if the decision boundary is outside of a warp, the cost of divergence is avoided (for that warp). My understanding is that the warp execution engine will only follow a divergent path if one or more threads in the warp have actually selected that path. If 0 threads in the warp have selected a particular divergent path, the warp execution engine is smart enough not to schedule that path (with all threads masked). — Robert Crovella, Aug 31 '20 at 23:08

talonmies · Accepted Answer · 2020-09-01T06:44:55.117

Then due to the SIMT execution of threads in a warp, the execution of the conditional will be serialized and all threads will be required to run both the x and y sections of the code

That only happens if the conditional doesn't evaluate uniformly within a warp

The exception to this is if the branches are big, in which case the compiler will insert a check using __any to avoid unnecessarily running code.

That is completely incorrect. That compiler doesn't do that, and it is trivial to disassemble literally any code emitted by any version of CUDA compiler NVIDIA has ever released to confirm that. There is predicated execution, but that is significantly different to what you describe.

However, if I already know ahead of time that all threads in a warp will have the same value of condition, then this __any operation is unnecessary, merely serving to slow down my code.

It isn't only unnecessary, it is nonexistent.

I am wondering if there exists any way to instruct the compiler not to include this voting operation, but instead to assume that the evaluation of the condition is the same for all threads in the warp, and to run only the corresponding block of code?

No because what you want is the default behaviour.

How to avoid executing both branches of conditional in CUDA program if it is known that the condition is the same for all threads in a warp?

1 Answers1