4

When I'm programming on a normal day, I make sure that all branches are most likely not taken.

int retval = do_somting();
if(!retval) { /* Less-than-likely event*/ }

This optimists branch predictions, causing the CPU's predictor bit(s) set to "do not take". However do the predictor bit(s) get forced back into "take" after a for loop?

// prediction = "likely take"
if(false) { }

// prediction = "probably take"
if(false) { }

// prediction = "probably not take"
if(false) { }

// prediction = "likely not take"
if(false) { }

/* ... thousands of other if(false) that are speedy-fast */

for(int i = 0; i < 5; i++) { }
// prediction = "likely take"?

I know it's an unrealistic and minuscule optimization, but hey, the more you know.

EDIT: Let's assume GCC does not trash all of this code above, and let's also only talk about amd64 architecture. As I did not realize how low-level this question is.

Dellowar
  • 3,160
  • 1
  • 18
  • 37
  • I'm not sure if it will even end up being an optimisation, due to compiler and CPU optimisation. – AntonH Jan 26 '17 at 16:04
  • 2
    Don't waste your time. The more you learn, the less you know :-) –  Jan 26 '17 at 16:04
  • @AntonH Obviously lets pretend that 0 compiler optimizations are done. For instance, if GCC sees this it will pretty much toss out all of the code... – Dellowar Jan 26 '17 at 16:07
  • The branch predictor and the programmer need not necessarily agree on which branch that is most likely. – Lundin Jan 26 '17 at 16:09
  • Are you sure that the CPU's branch prediction flag is mapped to your `if` statement in a way that you could control it from C? I would expect that the branch is to jump after the body of your `if` part. And "don't take the branch" might mean "execute the if part" in the end. – Gerhardh Jan 26 '17 at 16:09
  • 1
    Is your question "does a for loop use branches"? If so, the answer is yes. – Kevin Jan 26 '17 at 16:09
  • 1
    Also branch predictors usually have many sets of bits. Taking or not taking a branch won't affect the prediction for all other branches. A simple predictor can look at a set of bits based on the location of the branch in the code (e.g. program counter mod # of entries in the predictor). – Kevin Jan 26 '17 at 16:14
  • This is a really architecture-dependent question – Govind Parmar Jan 26 '17 at 16:19
  • @GovindParmar You're right. I'm going to narrow it down amd64. Based on these other arguments, this is a very in-depth question I would not think of being this in-depth. – Dellowar Jan 26 '17 at 16:23
  • Branch prediction is a virtue by itself. Ever CPU family (not just the architecture) has it's own methods, techniques, etc. So without a specific CPU type your question cannot be answered. Even for a specific type, you should not care about this and leave optimisation to the compiler. Otherwise you are very likely to mess things up and get worse performing code than without caring. – too honest for this site Jan 26 '17 at 16:23
  • A good way to see it it to check the assembly file compiled by your compiler to analyse its structure, with and without compiler optimisation. – m.raynal Jan 26 '17 at 16:25
  • @SanchkeDellowar: amd64 is already way too broad. AMD bulldozer has different techniques than Intel Sandy Bridge, than Kabby Bridge or Bridge over troubled Water. – too honest for this site Jan 26 '17 at 16:25
  • @m.raynal: The machine code will not tell how branch prediction works and how well the code performs in a specific scenario - unless you hand-run it, considering all pipeline stages, stalls, caches, the platform with PCIe bridges, RAM controller, etc. That is impossible for any useful program code on Smartphones, PCs and larger. – too honest for this site Jan 26 '17 at 16:27
  • @Olaf Darn. I wouldn't have guessed this was a silicone question. I'll post what I've learned and close the question. – Dellowar Jan 26 '17 at 16:36
  • "When I'm programming on a normal day..." Do you have some special reason for hand-optimizing your code, or have you forgotten rule number 1 about hand-optimizing code: **Don't do it.** – Thomas Padron-McCarthy Jan 26 '17 at 16:42
  • 2
    @SanchkeDellowar: I assume it is more a silicon than a silicon**e** question. The latter are better suited at Playboy or Penthouse ;-) – too honest for this site Jan 26 '17 at 16:43
  • "let's also only talk about amd64 " --> then tag the post with `amd` or something narrower. – chux - Reinstate Monica Jan 26 '17 at 18:18

2 Answers2

3

As it turns out branch prediction is depended on model of the CPU.

According to this paper, branch prediction is handled in countless amount of ways when relating loops to normal branches. Some CPUs have a separate predictor loops. So that means if statements do not at all effect the prediction of a for statement. Others they share the same prediction.

Regardless, there is not one true answer to this question. For loops are not to be measured when talking about branch efficiency.

...Unless of course you plan to run your program on only a single model of CPU.

Dellowar
  • 3,160
  • 1
  • 18
  • 37
1

Most architectures with branch prediction (including AMD64) consider short downward/forward jumps/branches unlikely and short upward/backward branches/jumps likely. This means that most loops are predicted to continue looping. This makes a do-while loop fractionally more efficient than a for loop or while loop because of the initial conditional; however most optimizing compilers will optimize these cases to similar code where possible.

You can see the differences in assembly with gcc at -O3 optimization level by using a conditional with __builtin_expect(). The unlikely branch will typically be a forward jump while the likely condition(s) will either not branch at all or jump backward. This may involve inverting the logic. Note: at -O3, gcc will often duplicate code into the unlikely branch so that the branches in the likely cases can be minimized.

This makes sense because a loop that fits in a cache line will not have a cache miss if it branches to its beginning. Similarly, since the programs generally progress linearly forward within a function, it is also likely that recently executed code will already be in cache. When you replace a loop with a bunch of extra "optimized" conditionals, at some point (probably around 4 conditionals) the cache misses will override any miniscule benefits you may get at the cost of readability and maintainability.

technosaurus
  • 7,676
  • 1
  • 30
  • 52
  • "This means that most loops are predicted to continue looping" If you could cite that that'd be great! – Dellowar Jan 26 '17 at 19:36
  • @SanchkeDellowar Most of this is covered by https://en.wikipedia.org/wiki/Branch_predictor What I described is static prediction, the simplest form. The topic is really too broad to cover in detail since each architecture is different and even different generations can have drastically different algorithms. – technosaurus Jan 27 '17 at 04:29