Branch "anticipation" in modern CPUs

Question

I was recently thinking about branch prediction in modern CPUs. As far as I understand, branch prediction is necessary, because when executing instructions in a pipeline, we don't know the result of the conditional operation right before taking the branch.

Since I know that modern out-of-order CPUs can execute instructions in any order, as long as the data dependencies between them are met, my question is, can CPUs reorder instructions in such a way that the branch target is already known by the time the CPU needs to take the branch, thus can "anticipate" the branch direction, so it doesn't need to guess at all?

So can the CPU turn this:

do_some_work();
if(condition()) //evaluating here requires the cpu to guess the direction or stall
   do_this();
else
   do_that();

To this:

bool result = condition();
do_some_work(); //bunch of instructions that take longer than the pipeline length
if(result) //value of result is known, thus decision is always 100% correct
   do_this();
else
   do_that();

A particular and very common use case would be iterating over collections, where the exit condition is often loop-invariant(since we usually don't modify the collection while iterating over it).

My question is can modern generally CPUs do this, and if so, which particular CPU cores are known to have this feature?

Well, yeah, the compiler can reorganize the code to be like the second example, but the CPU still has to know that the `result` variable contains the branch direction, and act accordingly. — torgabor, Jun 30 '15 at 16:09
I think that could be tested with a profiler with branch prediction miss rate. — edmz, Jun 30 '15 at 16:11
I believe branch prediction is a bit lower level and "dumb" in that it doesn't get to know much about the state of the program, only the basic feeding of instructions. So no, I don't believe CPUs do this. — Cory Nelson, Jun 30 '15 at 16:17
As far as I understand, the branch-predictor depends quite a lot on the compiler and that it organizes the code in a way so that it's easy to predict branches. — Some programmer dude, Jun 30 '15 at 16:21
You might find Rami Sheikh et al.'s "Control-Flow Decoupling" (2012, [ACM page](http://dl.acm.org/citation.cfm?id=2457509); [PDF](http://www4.ncsu.edu/~rmalshei/i/micro2012.pdf)) interesting. — , Jun 30 '15 at 18:38
Hmm, no, sounds like you are expecting a cpu core to solve the halting problem. It's been done, RISC cores used to have a "branch slot", an extra instruction that would always be executed after a branch to buy a delay. Scales like crap, big reason you don't have a RISC core in your machine today. — Hans Passant, Jun 30 '15 at 21:52
For loops, it almost always makes sense to predict the path will go inside the loop. — Mike Dunlavey, Jun 30 '15 at 22:44
This is a quite interesting approach. Anyway, I think that collection traversal is not a good example as defaulting to the "loop" branch always costs a single misprediction whatever the number of iterations. — , Jul 01 '15 at 09:38
Related / possible duplicate: [Avoid stalling pipeline by calculating conditional early](//stackoverflow.com/q/49932119) — Peter Cordes, Apr 01 '19 at 01:40

score 3 · Answer 1 · answered Jul 01 '15 at 19:01

Keep in mind that branch prediction is done so early along the pipe, that you still don't have the instruction decoded, and you can't resolve the data dependency because you don't know which register is used. You may be able to remember that somewhere, but that's not 100% (since your storage capacity/time will be limited), so that's pretty much what your normal branch predictor alreay does - speculate the target based on the instruction pointer alone.

However, pulling the condition evaluation earlier is useful, it's been done in the past, and is mostly a compiler technique, but may be enhanced with some HW support (e.g. - hoisting branch condition). The main performance impact of the branch misprediction is the delay in evaluation though, since the branch recovery itself these days is pretty short.

This means that you can mitigate most of the penalty with a compiler hoisting the condition only and calculating this earlier, and without any HW modification - you're still paying the penalty of the flush in case you mispredicted the branch (and the odds are usually low with contemporary predictors), but you'll know that immediately upon decoding the branch itself (since the data will be ready in advance), so the damage will be limited to only a very few instructions that made it down the pipe past that branch.

Being able to hoist the evaluation isn't simple though. The compiler may be able to detect if there are any direct data dependencies in most cases (with do_some_work() in your example), but in most cases there will be. Loop invariants are one of the first things the compiler already moves today. In addition, some of the most hard-to-predict branches depend on some memory fetch, and you usually can't assume memory will stay the same (you can, with some special checks afterward, but most common compilers don't do that). Either way, it's still a compiler technique, and not a fundamental change in branch prediction.

score 0 · Answer 2 · answered Jul 02 '15 at 06:12

Branch prediction is done because the CPU's instruction fetcher needs to know which instructions to fetch after a branch instruction, and this is not known until after the branch executes.

If a processor has a 5 stage pipeline (most processors have more) like this:

Instruction fetch
Instruction decode
Register read
ALU execution
Register write back

the fetcher will stall for 3 cycles because the branch result won't be known until after the ALU execution cycle.

Hoisting the branch test condition does not address the latency from fetching a branch instruction to its execution.

Branch "anticipation" in modern CPUs

2 Answers2