Are branch predictors results saved after process uses its timeslice

Question

During discussion developer informed that

likely/unlikely gcc optimization
placing most common branch first in code

have no effect and should be ignored on Intel processors. The stated reason is dynamic branch prediction employed by Intel. I have 2 questions, I could not find explicit answer:

Is branch prediction data global for the processor(core) or it is per process?
If it is per process. Are Branch target buffer with results saved during entire process existence or is it flashed when process used it's timeslice and instruction cache got flashed or it moved to another core?

Assumptions:

Linux
Skylake Intel processor
Separate several processes run on a core.

I cant imagine it being that overly complicated. In theory it can only see as far as potential branches in the pipe plus other data/instructions and take a guess as to whether or not to fetch any of those branches (of the ones it can actually see, ones that dont have to wait for a computation to complete in time). — old_timer, Jan 31 '16 at 18:25
Somewhat related: https://lwn.net/Articles/420019/ https://lwn.net/Articles/70473/ See also Agner Fog's microarchitecture docs for how branch predictors in different processors work: http://www.agner.org/optimize/microarchitecture.pdf — ninjalj, Jan 31 '16 at 23:56
Whoever told you this apparently mixed up the compiler intrinsic which can have an effect on the generated code with the x86 instruction prefix which is ignored by modern processors. — MikeMB, Jan 31 '16 at 23:58
@ninjalj I have read Agner's doc before posting this question and a few other sources. Agner Fog doc's are excellent. It was still unclear to me the life span of predictions. — VladimirS, Feb 01 '16 at 14:12
@MikeMB As I understand likely/unlikely could reorder switches in the compiled code. There is possible issues with it, aka programmers assumptions are bad. However predictions help processor pre-load expected path regardless of branch order in the code. My question is not about likely/unlikey (it documented by gcc well). My question how long predictions data exist and used. Agner Fog explains detail on prediction compare to branch, I am trying to figure out what happen compare to process. — VladimirS, Feb 01 '16 at 14:33
@user3545806: I know, that's why I didn't make it an answer but just a comment on the first part of your question. Btw: based on likely/unlikely, the compiler could (in theory) do much more than just reordering. It could e.g effect partial inlining or outlining decisions — MikeMB, Feb 01 '16 at 14:35
I fear you won't find more detailed information, about how exactly a skylake branch predictior works than what can be found in the document that ninjalj already mentioned. My personal expectation is that - If at all - there might be a separate branch prediction per HW-Thread (on HT enabled processors), but certainly not per sw- process. Simply because I could not find any documentation saying that this information would be part of what is saved on a contex switch and I don't think it would be worth the cost if the processor did it automatically in HW (e.g. using a similar technique as TLB). — MikeMB, Feb 01 '16 at 15:43

score 2 · Answer 1 · answered Jan 31 '16 at 19:05

Likely/unlikely optimisation has nothing to do whatsoever with branch prediction.

When an Intel processor encounters a conditional branch, it is fastest if the branch is not taken. In a straightforward if/else statement, the conditional branch would be followed by the if statement. So if the else statement is executed 99% of the time, this isn't optimal. The compiler would replace if (condition) ifbranch else elsebranch with if (!condition) elsebranch else ifbranch, so that most of the time the branch is not taken (if that's what a likely/unlikely optimisation tells you).

Or consider a loop that is on average execute less than once (for example only one in 100 times). Normally a compiler would extract l oop-independent code from the loop. That's a waste of time if the loop is never executed! You can tell the compiler that the loop is likely not executed, and the loop-independent code will not be extracted.

In other words, the developer doesn't know what he is talking about. That said, we are talking about a micro optimisation that is rarely useful, like all micro optimisations, but that doesn't mean it doesn't work.

And branch optimisations are per processor. Nothing gets flushed or stored and restored.

It's still branch prediction. In the absence of any other information a forward branch is predicted not taken and a backwards branch is predicted taken. — Alan Stokes, Jan 31 '16 at 19:13
I agree with Alan, It does not answer my question. I know how likely/unlikely works. Question is how branch prediction works... — VladimirS, Jan 31 '16 at 23:20
@Gnasher729 could you elaborate on your last sentence. Does it mean that branch prediction data global for the processor(core)? So if process get next time slice it's predictions will be still there? — VladimirS, Feb 01 '16 at 15:13

score 1 · Answer 2 · edited Jun 27 '21 at 13:16

Branch prediction data is global per processor.

This means that multiple processes that share the same processor will interfere with each other's branch prediction when two different branches share the same prediction table entry. This is called aliasing. But to a certain extent aliasing can also occur within a single process.

"Process Switches and Branch Prediction Accuracy" (David Chen et al., 2005) looked into the performance benefit of using a prediction table per process. They found "a prediction improvement of 0.5 – 3%". But their conclusion was that "for general purpose applications, this proposed system provides both a limited benefit and comes with a high hardware cost due to the large number of parallel history tables that must be implemented and quickly accessed."

I doubt this conclusion has changed several years later. In fact modern multicore CPUs probably reduce aliasing quite a lot. The scheduler will tend to run a given thread/process on the same core, for performance. So if a system has 2 high-load processes, they would tend to each hog a single core and would hardly ever interfere with each (or any) other's branch prediction table.

Per-process prediction would be a solution to the Spectre problem! Perhaps we'll see it in a few years or decades from now, maybe in some new from-scratch hardware designs that set out to be more resistant to microarchitectural side-channel problems like Spectre instead of current mitigations like letting the kernel do costly flushing of branch prediction. https://en.wikipedia.org/wiki/Spectre_(security_vulnerability) — Peter Cordes, Jun 26 '21 at 15:42

Are branch predictors results saved after process uses its timeslice

2 Answers2