gpgpu: Why dont we need branch prediction in fine grain multi-threading?

Question

When a wavefront executes it provides Fine grained multithreading. One of the consequences of this is having no branch predictions requirement as given in the following slide:

enter image description here

But I am unable to understand this. Can someone explain this in a simple way?

Thanks for that question, right now I'm having the very same lecture as you — lmaooooo, Jul 14 '18 at 13:23

score 5 · Answer 1 · answered Aug 25 '14 at 04:00

Branches introduce significant latency into the execution of a stream of instructions. If the processor does not support speculative execution then no instructions are allowed to execute until the branch conditional is executed. If the branch is taken then the processor needs to fetch the new instruction line introducing additional latency. If the branch is not taken then execution can continue. On deep pipelines the evaluation of the conditional can introduce 10-20 cycles. Branch prediction and speculative execution allow the processor to continue to execute additional instructions or to start early fetch of the instruction if the branch is taken. If the branch prediction is incorrect then all instructions following the branch have to be thrown out (rolled back).

Branch prediction hardware is usually expensive in terms of area but even basic branch prediction (likely taken vs. likely not taken) can significantly improve IPC.

GPUs do not tend to implement branch prediction for at least 3 reasons:

Branch predictions goal is to improve IPC by speculative executing instructions instead of waiting for both the conditional result and the possibly additional instruction fetch. GPUs are designed to hide latency by switching between multiple threads of execution for free. While a warp/wavefront is waiting to determine the result of the branch conditional other warps/wavefronts can be issued to hide the latency.
Branch history tables are expensive in terms of area.
Speculative execution is expensive in terms of area.

Trudbert · Answer 2 · 2014-08-24T17:57:51.753

It says in the slide that only one instruction is in the pipeline at each time. The purpose of branch prediction is to keep the instruction pipeline from filling up with the wrong branch (load the if part only to realize it should have loaded the else instructions into the pipeline). This is not needed if only one instruction is in the pipeline because you don't have the investment of filling your x stages of pipeline (quick google: up to 30 for cpus) before you come to realise it was the wrong branch, have to flush the pipeline and start all over again.

score 1 · Answer 3 · answered Aug 24 '14 at 18:21

Some details depend on the actual GPU architecture. But a simplified example, in addition to the answer that Trudbert already gave (+1) :

For a branch like this

if (data[threadIndex] > 0.5) {
    data[threadIndex] = 1.0;
}

there may be a set of threads for which the statement is true, and another set of threads for which the statement is false. One can imagine it as if the threads for which the statement is false simply wait until the others have finished their work.

Analogously, for a branch like this

if (data[threadIndex] > 0.5) {
    data[threadIndex] = 1.0;
} else {
    data[threadIndex] = 0.0;
}

one can imagine this as all threads executing both paths of the branch, and making sure that the results from the "wrong" path are ignored. This is referred to as "predicated execution".

(More detailed information about this can be found in GPU Gems 2, Chapter 34)

So since there is no advantage in predicting the "right" branch (because every thread has to take all branches anyhow), there is no reason to introduce branch prediction.

gpgpu: Why dont we need branch prediction in fine grain multi-threading?

3 Answers3