14

I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches.

For example I have a code like this:

if (C)
 A
else
 B

so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming for one warp, both A and B are executed, so does it take in total 90 cycles to finish this branch? Or do they overlap A and B, i.e., when some instructions of A are executed, then wait for memory request, then some instructions of B are executed, then wait for memory, and so on? Thanks

talonmies
  • 70,661
  • 34
  • 192
  • 269
Zk1001
  • 2,033
  • 4
  • 19
  • 36
  • 12
    For those who are tempted to edit this question further, please note that branch predication and branch prediction are **not** the same thing...... – talonmies Aug 03 '11 at 15:39
  • 1
    i've found good description here: http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html – isti_spl Dec 07 '13 at 20:09

1 Answers1

15

All of the CUDA capable architectures released so far operate like an SIMD machine. When there is branch divergence within a warp, both code paths are executed by all the threads in the warp, with the threads which are not following the active path executing the functional equivalent of a NOP (I think I recall that there is a conditional execution flag attached to each thread in a warp which allows non executing threads to be masked off).

So in your example, the 90 cycles answer is probably a better approximation of what really happens than the alternative.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • As I remember, there is a branch instruction in CUDA asm, but the branch will be taken by all threads of warp. – osgx Jul 05 '11 at 12:11
  • 2
    As ptx_isa.pdf: "If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.". So, there is a conditional branch in PTX but all threads of Warp must take or not take this branch at same time to be uniform (to get performance) – osgx Jul 05 '11 at 12:15
  • Thanks talonmies and osgx. So in the above case, it would be 90 cycles to execute the code. But I wonder why don't they implement the alternative one? I mean overlap A and B so the throughput must be higher. In that case, the still operate in the SIMD manner, but just that the latency can be hidden better (by executing the other path of the branch) ? – Zk1001 Jul 05 '11 at 12:37
  • 1
    It is most probably a compromise. The more sophisticated you make branch predication and conditional execution, the more transistors are required to implement it. GPUs are designed to run code which doesn't have a lot of branches, and in that scenario it makes more sense to use as many transistors for the parts of the GPU that do computation, and less on the parts that reduce latency (instruction scheduling, cache, etc). – talonmies Jul 05 '11 at 12:43
  • Ok. That makes a lot of sense. Thanks for clarification! – Zk1001 Jul 05 '11 at 12:49
  • I believe that at the hardware level, divergence is actually realized at the __half__-warp level. Though there are several places nVidia said that this is an implementation detail not covered by the spec and may change at any time ... *without warning*. – M. Tibbits Jul 05 '11 at 18:48
  • 3
    That is incorrect, conditional execution is handled per warp, not per half warp. It is also worth pointing out that if the branch condition is not divergent within a warp (for example `if (threadIdx.x > 64)`, then there is no divergent execution. – harrism Jul 06 '11 at 05:42
  • I think harrism is right. Only memory transactions are processed per half-warps. But conditional branches are processed per warp. But the point I want to make here is, when there is a divergence within a warp, the threads, when the warp is executing the path they don't take, just go sleep (or perform null ops, or being disabled, whatever). – Zk1001 Jul 06 '11 at 05:54
  • 2
    Memory transactions are processed per warp also, on the Fermi architecture (current architecture). The older GT200 (aka Tesla Architecture) processed memory transactions per half warp. Half warps are not used anywhere on Fermi. – harrism Aug 03 '11 at 07:13