Why predict a branch, instead of simply executing both in parallel?

Question

I believe that when creating CPUs, branch prediction is a major slow down when the wrong branch is chosen. So why do CPU designers choose a branch instead of simply executing both branches, then cutting one off once you know for sure which one was chosen?

I realize that this could only go 2 or 3 branches deep within a short number of instructions or the number of parallel stages would get ridiculously large, so at some point you would still need some branch prediction since you definitely will run across larger branches, but wouldn't a couple stages like this make sense? Seems to me like it would significantly speed things up and be worth a little added complexity.

Even just a single branch deep would almost half the time eaten up by wrong branches, right?

Or maybe it is already somewhat done like this? Branches usually only choose between two choices when you get down to assembly, correct?

Even for just one level, you find yourself needing twice as much pipeline hardware (at least), which burns twice as much energy when active. Modern CPUs seek to minimise energy use (as heat dissipation is usually the bottleneck). — Oliver Charlesworth, Oct 19 '14 at 20:20
Excellent point.. so maybe that immediately cuts off the possibility of more than one level and is definitely a concern. But Intel i7 has 88 W TPD, whereas AMDs newest 5GHz processors are currently running at 220W TPD. So clearly it is possible to dissipate that much heat and have a chip that can handle it. I could see one step being huge. And I think I was wrong about the double speed up for one level.. if branch prediction is right 99% of time, then odds are even when wrong, then next level will be right.. so it'd likely be a much more than twice the speed up. — mczarnek, Oct 19 '14 at 20:53
The common academic term for this is "eager execution". (A [Google Scholar search](http://scholar.google.com/scholar?hl=en&q="eager+execution") will give some academic studies.) A more limited technique is dynamic hammock predication, which uses can use predictor confidence information to choose whether to predicate or use the prediction. — , Oct 20 '14 at 21:49

Leeor · Accepted Answer · 2014-10-21T07:59:14.860

You're right in being afraid of exponentially filling the machine, but you underestimate the power of that. A common rule-of-thumb says you can expect to have ~20% branches on average in your dynamic code. This means one branch in every 5 instructions. Most CPUs today have a deep out-of-order core that fetches and executes hundreds of instructions ahead - take Intels' Haswell for e.g., it has a 192 entries ROB, meaning you can hold at most 4 levels of branches (at that point you'll have 16 "fronts" and 31 "blocks" including a single bifurcating branch each - assuming each block would have 5 instructions you've almost filled your ROB, and another level would exceed it). At that point you would have progressed only to an effective depth of ~20 instructions, rendering any instruction-level parallelism useless.

If you want to diverge on 3 levels of branches, it means you're going ot have 8 parallel contexts, each would have only 24 entries available to run ahead. And even that's only when you ignore overheads for rolling back 7/8 of your work, the need to duplicate all state-saving HW (like registers, which you have dozens of), and the need to split other resources into 8 parts like you did with the ROB. Also, that's not counting memory management which would have to manage complicated versioning, forwarding, coherency, etc.

Forget about power consumption, even if you could support that wasteful parallelism, spreading your resources that thin would literally choke you before you could advance more than a few instructions on each path.

Now, let's examine the more reasonable option of splitting over a single branch - this is beginning to look like Hyperthreading - you split/share your core resources over 2 contexts. This feature has some performance benefits, granted, but only because both context are non-speculative. As it is, I believe the common estimation is around 10-30% over running the 2 contexts one after the other, depending on the workload combination (numbers from a review by AnandTech here) - that's nice if you indeed intended to run both the tasks one after the other, but not when you're about to throw away the results of one of them. Even if you ignore the mode switch overhead here, you're gaining 30% only to lose 50% - no sense in that.

On the other hand, you have the option of predicting the branches (modern predictors today can reach over 95% success rate on average), and paying the penalty of misprediction, which is partially hidden already by the out-of-order engine (some instructions predating the branch may execute after it's cleared, most OOO machines support that). This leaves any deep out-of-order engine free to roam ahead, speculating up to its full potential depth, and being right most of the time. The odds of flusing some of the work here do decrease geometrically (95% after the first branch, ~90% after the second, etc..), but the flush penalty also decreases. It's still far better than a global efficiency of 1/n (for n levels of bifurcation).

It should be noted that predication is a software mechanism for removing the need to predict branches. (Even just conditional move is enough to effectively predicate code that never generates exceptions.) IBM somewhat recently implemented single instruction *dynamic*, limited (and selective) predication in POWER7. — , Oct 20 '14 at 21:36
@PaulA.Clayton, true, but I would hazard a guess that predication doesn't work well with general purpose CPUs. It's more suited to dataflow architectures, which in turn tend to be very specialized (or otherwise - very bad) — Leeor, Oct 20 '14 at 22:21
Predication adds an additional source operand with the expected implications for readiness checking in the issue queue of an out-of-order processor. This also introduces a dataflow dependency; with branch prediction operations that only have control flow dependencies on a not yet available value can be speculatively executed. (Of course, then one could have predicate prediction ...). — , Oct 20 '14 at 22:58

Why predict a branch, instead of simply executing both in parallel?

1 Answers1

Linked