This is from the OpenCL Optimization Manual:
Use predication rather than control-flow. The predication allows the
GPU to execute both paths of execution in parallel, which can be
faster than attempting to minimize the work through clever
control-flow. The reason for this is that if no memory operation
exists in a ?: operator (also called a ternary operator), this
operation is translated into a single cmov_logical instruction, which
is executed in a single cycle. An example of this is:
If (A>B) { C += D; } else { C -= D; }
Replace this with:
int factor = (A>B) ? 1:-1; C += factor*D;
In the first block of code, this translates into an IF/ELSE/ENDIF
sequence of conditional code, each taking ~8 cycles. If divergent,
this code executes in ~36 clocks; otherwise, in ~28 clocks. A branch
not taken costs four cycles (one instruction slot); a branch taken
adds four slots of latency to fetch instructions from the instruction
cache, for a total of 16 clocks. Since the execution mask is saved,
then modified, then restored for the branch, ~12 clocks are added when
divergent, ~8 clocks when not.
In the second block of code, the ?:
operator executes in the vector units, so no extra CF instructions are
generated. Since the instructions are sequentially dependent, this
block of code executes in 12 cycles, for a 1.3x speed improvement. To
see this, the first cycle is the (A>B) comparison, the result of which
is input to the second cycle, which is the cmov_logical factor, bool,
1, -1. The final cycle is a MAD instruction that: mad C, factor, D, C.
If the ratio between conditional code and ALU instructions is low,
this is a good pattern to remove the control flow.
Seems like your 0/1 selection could be based on prediction. Not sure if this is what you are looking for.
Source: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/