Questions tagged [branch-prediction]

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

Why is it faster to process a sorted array than an unsorted array? Stack Overflow's highest-voted question and answer is a good introduction to the subject.


In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline.

Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

Two-way branching is usually implemented with a conditional jump instruction. A conditional jump can either be "not taken" and continue execution with the first branch of code which follows immediately after the conditional jump - or it can be "taken" and jump to a different place in program memory where the second branch of code is stored.

It is not known for certain whether a conditional jump will be taken or not taken until the condition has been calculated and the conditional jump has passed the execution stage in the instruction pipeline.

Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed. If it is later detected that the guess was wrong then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.

The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. The longer the pipeline the greater the need for a good branch predictor.

Source: http://en.wikipedia.org/wiki/Branch_predictor


The Spectre security vulnerability revolves around branch prediction:


Other resources

Special-purpose predictors: Return Address Stack for call/ret. ret is effectively an indirect branch, setting program-counter = return address. This would be hard to predict on its own, but calls are normally made with a special instruction so modern CPUs can match call/ret pairs with an internal stack.

Computer architecture details about branch prediction / speculative execution, and its effects on pipelined CPUs

  • Why is it faster to process a sorted array than an unsorted array?
  • Branch prediction - Dan Luu's article on branch prediction, adapted from a talk. With diagrams. Good introduction to why it's needed, and some basic implementations used in early CPUs, building up to more complicated predictors. And at the end, a link to TAGE branch predictors used on modern Intel CPUs. (Too complicated for that article to explain, though!)
  • Slow jmp-instruction - even unconditional direct jumps (like x86's jmp) need to be predicted, to avoid stalls in the very first stage of the pipeline: fetching blocks of machine code from I-cache. After fetching one block, you need to know which block to fetch next, before (or at best in parallel with) decoding the block you just fetched. A large sequence of jmp next_instruction will overwhelm branch prediction and expose the cost of misprediction in this part of the pipeline. (Many high-end modern CPUs have a queue after fetch before decode, to hide bubbles, so some blocks of non-branchy code can allow the queue to refill.)
  • Branch target prediction in conjunction with branch prediction?
  • What branch misprediction does the Branch Target Buffer detect?

Cost of a branch miss


Modern TAGE predictors (in Intel CPUs for example) can "learn" amazingly long patterns, because they index based on past branch history. (So the same branch can get different predictions depending on the path leading up to it. A single branch can have its prediction data scattered over many bits in the branch predictor table). This goes a long way to solving the problem of indirect branches in an interpreter almost always mispredicting (X86 prefetching optimizations: "computed goto" threaded code and Branch prediction and the performance of interpreters — Don't trust folklore), or for example a binary search on the same data with the same input can be really efficient.

Static branch prediction on newer Intel processors - according to experimental evidence, it appears Nehalem and earlier do sometimes use static prediction at some point in the pipeline (backwards branches default to predicted-taken, forward to not-taken.) But Sandybridge and newer seem to be always dynamic based on some history, whether it's from this branch or one that aliases it. Why did Intel change the static branch prediction mechanism over these years?

Cases where TAGE does "amazingly" well


Assembly code layout: not so much for branch prediction, but because not-taken branches are easier on the front-end than taken branches. Better I-cache code density if the fast-path is just a straight line, and taken branches mean the part of a fetch block after the branch isn't useful.

Superscalar CPUs fetch code in blocks, e.g. aligned 16 byte blocks, containing multiple instructions. In non-branching code, including not-taken conditional branches, all of those bytes are useful instruction bytes.


Branchless code: using cmov or other tricks to avoid branches

This is the asm equivalent of replacing if (c) a=b; with a = c ? b : a;. If b doesn't have side-effects, and a isn't a potentially-shared memory location, compilers can do "if-conversion" to do the conditional with a data dependency on c instead of a control dependency.

(C compilers can't introduce a non-atomic read/write: that could step on another thread's modification of the variable. Writing your code as always rewriting a value tells compilers that it's safe, which sometimes enables auto-vectorization: AVX-512 and Branching)

Potential downside to cmov in scalar code: the data dependency can become part of a loop-carried dependency chain and become a bottleneck, while branch prediction + speculative execution hide the latency of control dependencies. The branchless data dependency isn't predicted or speculated, which makes it good for unpredictable cases, but potentially bad otherwise.

363 questions
2
votes
1 answer

Use of __builtin_expected for bounds check

I have this function which, given a Gray code, returns the next Gray code. You can find a more complete explanation about how it works here. The thing is that I wanted to make this increment function modular so that incrementing the Gray code…
Morwenn
  • 21,684
  • 12
  • 93
  • 152
2
votes
0 answers

Why does the Apple A7 (ARMv8a) have 2 branch units (in addition to the indirect branch unit)

The Apple A7 microarchitecture has 2 branch units and an indirect branch unit. Since the A7 is a modern superscalar out of order cpu with a reasonably deep pipeline (read that as a significant penalty for speculation failure), it makes sense that it…
Olsonist
  • 2,051
  • 1
  • 20
  • 35
2
votes
2 answers

Get conditional branch slot from MIPS cross compiler

How can I get conditional branch slot, in which an instruction from before or after the branch is moved to fill in the slot, using mipsel-openwrt-linux-gcc cross compiler? I just use the command to get the MIPS code: ./mipsel-openwrt-linux-gcc -O2…
2
votes
3 answers

gpgpu: Why dont we need branch prediction in fine grain multi-threading?

When a wavefront executes it provides Fine grained multithreading. One of the consequences of this is having no branch predictions requirement as given in the following slide: But I am unable to understand this. Can someone explain this in a…
user25108
  • 383
  • 5
  • 15
2
votes
3 answers

branch prediction

Consider the following sequence of actual outcomes for a single static branch. T means the branch is taken. N means the branch is not taken. For this question, assume that this is the only branch in the program. T T T N T N T T T N T N T T T N T…
aherlambang
  • 14,290
  • 50
  • 150
  • 253
2
votes
1 answer

How to make a per-frame branch optimization-friendly?

Suppose I have a main loop that updates different things per frame: int currentFrame = frame % n; if ( currentFrame == 0 ) { someVar = frame; } else if ( currentFrame == 1 ) { someOtherVar = x; } ... else if ( currentFrame == n - 1 ) { …
Luchian Grigore
  • 253,575
  • 64
  • 457
  • 625
2
votes
1 answer

Input for branch predictor unit?

I am looking at slide 13 here: http://research.engineering.wustl.edu/~songtian/pdf/intel-haswell.pdf (It should show a large block diagram for Haswell) At the top it has a block called "Branch Predictors", with two arrows coming out. I am a little…
user997112
  • 29,025
  • 43
  • 182
  • 361
2
votes
1 answer

different branch prediction results in different processors

I would like to ask some things on branch prediction. I am completely aware of what it is and how do they work or their different types. My question is this: How does the processor that i will use each predictor's performance? I mean if I use the…
ghostrider
  • 5,131
  • 14
  • 72
  • 120
2
votes
1 answer

How does the branch predictor know it has made a wrong guess?

My question comes out of Mystical's answer. As I have understood, you have a branch instruction, it can either go to another instruction, say like, 0x123344 or it can continue executing. If a branch predictor makes guess from either of them from…
Shubham
  • 21,300
  • 18
  • 66
  • 89
2
votes
2 answers

Branch prediction - questions about target prediction and using the PC

So I understand the basic techniques that are used in branch prediction for pipelined processors - stuff like 2-bit saturated counters, two level adaptive predictors, etc. Here are my questions: 1) Branch target prediction: why is this important…
JDS
  • 16,388
  • 47
  • 161
  • 224
1
vote
0 answers

Does branchless programming make sense on very old x86 CPUs? (before 80486)

Modern CPUs since at least the 486 ¹) have a tightly-pipelined design, so conditional branches can cause "stalls" in which the pipeline has to be flushed and the code restarted on a different branch of the program. That's why it makes sense to avoid…
1
vote
0 answers

Branch Prediction: What is the BTB eviction scheme used in modern CPUs (Intel skylake for example)?

For branch prediction, the BHT(Branch history table) is indexed by branch virtual address. Aliasing problem happens when two or more branches hash to the same entry in the BHT(Branch history table), and this confliction results in bad prediction…
Changbin Du
  • 501
  • 5
  • 11
1
vote
0 answers

Why is there a connection between branch prediction failure and "rep ret" in the K8 processor?

I am currently looking for answers to why gcc generates strange instructions like "rep ret" in the generated assembly code. I came across a question on Stack Overflow where someone raised a similar question: text. In the answers provided, someone…
1
vote
0 answers

Influencing branchiness when branch behaviour is known

Before I begin, yes, I'm aware of the compiler built-ins __builtin_expect and __builtin_unpredictable (Clang). They do solve the issue to some extent, but my question is about something neither completely solves. As a very simple example, suppose we…
1
vote
0 answers

Optimizing std::clamp with favor for in-range input: is there a point for keeping cmov instead of a branch?

C++17 std::clamp is a template function that makes sure the input value is not less than the given minimum and less than the given maximum, and returns the input value; otherwise it returns the minimum or the maximum respectively. The goal is to…