Questions tagged [branch-prediction]

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

Why is it faster to process a sorted array than an unsorted array? Stack Overflow's highest-voted question and answer is a good introduction to the subject.


In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline.

Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

Two-way branching is usually implemented with a conditional jump instruction. A conditional jump can either be "not taken" and continue execution with the first branch of code which follows immediately after the conditional jump - or it can be "taken" and jump to a different place in program memory where the second branch of code is stored.

It is not known for certain whether a conditional jump will be taken or not taken until the condition has been calculated and the conditional jump has passed the execution stage in the instruction pipeline.

Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed. If it is later detected that the guess was wrong then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.

The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. The longer the pipeline the greater the need for a good branch predictor.

Source: http://en.wikipedia.org/wiki/Branch_predictor


The Spectre security vulnerability revolves around branch prediction:


Other resources

Special-purpose predictors: Return Address Stack for call/ret. ret is effectively an indirect branch, setting program-counter = return address. This would be hard to predict on its own, but calls are normally made with a special instruction so modern CPUs can match call/ret pairs with an internal stack.

Computer architecture details about branch prediction / speculative execution, and its effects on pipelined CPUs

  • Why is it faster to process a sorted array than an unsorted array?
  • Branch prediction - Dan Luu's article on branch prediction, adapted from a talk. With diagrams. Good introduction to why it's needed, and some basic implementations used in early CPUs, building up to more complicated predictors. And at the end, a link to TAGE branch predictors used on modern Intel CPUs. (Too complicated for that article to explain, though!)
  • Slow jmp-instruction - even unconditional direct jumps (like x86's jmp) need to be predicted, to avoid stalls in the very first stage of the pipeline: fetching blocks of machine code from I-cache. After fetching one block, you need to know which block to fetch next, before (or at best in parallel with) decoding the block you just fetched. A large sequence of jmp next_instruction will overwhelm branch prediction and expose the cost of misprediction in this part of the pipeline. (Many high-end modern CPUs have a queue after fetch before decode, to hide bubbles, so some blocks of non-branchy code can allow the queue to refill.)
  • Branch target prediction in conjunction with branch prediction?
  • What branch misprediction does the Branch Target Buffer detect?

Cost of a branch miss


Modern TAGE predictors (in Intel CPUs for example) can "learn" amazingly long patterns, because they index based on past branch history. (So the same branch can get different predictions depending on the path leading up to it. A single branch can have its prediction data scattered over many bits in the branch predictor table). This goes a long way to solving the problem of indirect branches in an interpreter almost always mispredicting (X86 prefetching optimizations: "computed goto" threaded code and Branch prediction and the performance of interpreters — Don't trust folklore), or for example a binary search on the same data with the same input can be really efficient.

Static branch prediction on newer Intel processors - according to experimental evidence, it appears Nehalem and earlier do sometimes use static prediction at some point in the pipeline (backwards branches default to predicted-taken, forward to not-taken.) But Sandybridge and newer seem to be always dynamic based on some history, whether it's from this branch or one that aliases it. Why did Intel change the static branch prediction mechanism over these years?

Cases where TAGE does "amazingly" well


Assembly code layout: not so much for branch prediction, but because not-taken branches are easier on the front-end than taken branches. Better I-cache code density if the fast-path is just a straight line, and taken branches mean the part of a fetch block after the branch isn't useful.

Superscalar CPUs fetch code in blocks, e.g. aligned 16 byte blocks, containing multiple instructions. In non-branching code, including not-taken conditional branches, all of those bytes are useful instruction bytes.


Branchless code: using cmov or other tricks to avoid branches

This is the asm equivalent of replacing if (c) a=b; with a = c ? b : a;. If b doesn't have side-effects, and a isn't a potentially-shared memory location, compilers can do "if-conversion" to do the conditional with a data dependency on c instead of a control dependency.

(C compilers can't introduce a non-atomic read/write: that could step on another thread's modification of the variable. Writing your code as always rewriting a value tells compilers that it's safe, which sometimes enables auto-vectorization: AVX-512 and Branching)

Potential downside to cmov in scalar code: the data dependency can become part of a loop-carried dependency chain and become a bottleneck, while branch prediction + speculative execution hide the latency of control dependencies. The branchless data dependency isn't predicted or speculated, which makes it good for unpredictable cases, but potentially bad otherwise.

363 questions
38
votes
1 answer

How has CPU architecture evolution affected virtual function call performance?

Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz. It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction…
spraff
  • 32,570
  • 22
  • 121
  • 229
35
votes
2 answers

Branchless internal merge slower than internal merge with branch

I recently asked a question on Code Review to review a sorting algorithm named QuickMergeSort. I won't get in the details, but at some point the algorithm performs an internal mergesort: instead of using additional memory to store the data to merge,…
Morwenn
  • 21,684
  • 12
  • 93
  • 152
35
votes
3 answers

How can I make branchless code?

Related to this answer: https://stackoverflow.com/a/11227902/4714970 In the above answer, it's mentioned how you can avoid branch prediction fails by avoiding branches. The user demonstrates this by replacing: if (data[c] >= 128) { sum +=…
Aequitas
  • 2,205
  • 1
  • 25
  • 51
35
votes
4 answers

Intel x86 0x2E/0x3E Prefix Branch Prediction actually used?

In the latest Intel software dev manual it describes two opcode prefixes: Group 2 > Branch Hints 0x2E: Branch Not Taken 0x3E: Branch Taken These allow for explicit branch prediction of Jump instructions (opcodes likeJxx) I remember reading…
Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
34
votes
2 answers

Does GCC generate suboptimal code for static branch prediction?

From my university course, I heard, that by convention it is better to place more probable condition in if rather than in else, which may help the static branch predictor. For instance: if (check_collision(player, enemy)) { // very unlikely to be…
Grzegorz Szpetkowski
  • 36,988
  • 6
  • 90
  • 137
33
votes
5 answers

Why does this C++ function produce so many branch mispredictions?

Let A be an array that contains an odd number of zeros and ones. If n is the size of A, then A is constructed such that the first ceil(n/2) elements are 0 and the remaining elements 1. So if n = 9, A would look like this: 0,0,0,0,0,1,1,1,1 The goal…
jsguy
  • 2,069
  • 1
  • 25
  • 36
32
votes
4 answers

Performance optimisations of x86-64 assembly - Alignment and branch prediction

I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen(), memset(), etc, using x86-64 assembly with SSE-2 instructions. So far I’ve managed to get excellent results in terms of performance, but I…
Macmade
  • 52,708
  • 13
  • 106
  • 123
28
votes
5 answers

How prevalent is branch prediction on current CPUs?

Due to the huge impact on performance, I never wonder if my current day desktop CPU has branch prediction. Of course it does. But how about the various ARM offerings? Does iPhone or android phones have branch prediction? The older Nintendo DS? How…
porgarmingduod
  • 7,668
  • 10
  • 50
  • 83
28
votes
4 answers

Branch Prediction and Division By Zero

I was writing code that looked like the following... if(denominator == 0){ return false; } int result = value / denominator; ... when I thought about branching behavior in the CPU. https://stackoverflow.com/a/11227902/620863 This answer says…
27
votes
2 answers

Branch target prediction in conjunction with branch prediction?

EDIT: My confusion arises because surely by predicting which branch is taken, you are effectively doing the target prediction too?? This question is intrinsically linked to my first question on the topic: branch prediction vs branch target…
user997112
  • 29,025
  • 43
  • 182
  • 361
27
votes
2 answers

How far does GCC's __builtin_expect go?

While answering another question I got curious about this. I'm well aware that if( __builtin_expect( !!a, 0 ) ) { // not likely } else { // quite likely } will make the "quite likely" branch faster (in general) by doing something along the…
Dave
  • 44,275
  • 12
  • 65
  • 105
25
votes
3 answers

Do branch likelihood hints carry through function calls?

I've come across a few scenarios where I want to say a function's return value is likely inside the body of a function, not the if statement that will call it. For example, say I want to port code from using a LIKELY macro to using the new…
Riley
  • 982
  • 1
  • 7
  • 19
25
votes
4 answers

Is there a code that results in 50% branch prediction miss?

The problem: I'm trying to figure out how to write a code (C preffered, ASM only if there is no other solution) that would make the branch prediction miss in 50% of the cases. So it has to be a piece of code that "is imune" to compiler optimizations…
25
votes
5 answers

What is the point of delay slots?

So from my understanding of delay slots, they occur when a branch instruction is called and the next instruction following the branch also gets loaded from memory. What is the point of this? Wouldn't you expect the code after a branch not to run in…
James
  • 706
  • 3
  • 8
  • 16
24
votes
3 answers

Why did Intel change the static branch prediction mechanism over these years?

From here I know Intel implemented several static branch prediction mechanisms these years: 80486 age: Always-not-taken Pentium4 age: Backwards Taken/Forwards Not-Taken Newer CPUs like Ivy Bridge, Haswell have become increasingly intangible, see…
1
2
3
24 25