4

Consider the following situation:

  • You have a macro doing something very often throughout the whole code. (For example some Exception handling)

  • This macro will usually do very little, but periodically, certain circumstances arise such that the macro must do a lot more...

  • This could be easily implemented using conditional branches to choose if the complex or the simple code is needed, and branch if the complex code is needed... BUT this may lead to the following grave performance issue:

    • Many modern branch predictors use the same prediction structures for multiple branches, such that the data collected from other branches affects the prediction of every single branch! Thus, the overwhelming count of branches not taken most of the time may "confuse" the branch predictor, such that it will make horrible predictions for the other branches!

How can I get around this issue?

Note that, because the complex code is called very very seldomly, I really don't care about the efficiency in that case!

(A starting point for research may be: how do languages like java get around that issue?)

KGM
  • 287
  • 2
  • 9
  • 2
    Branches that are almost never taken are generally recognised fairly well by the branch predictor. I think you are worrying way too much about this scenario. – fuz Jul 09 '20 at 17:35
  • I don't know... my fear is based around the agree predictors way of working in https://www.agner.org/optimize/microarchitecture.pdf . I think it will become completely useless if you flood it with branches never executed... I fear that there are many predictors working alike... – KGM Jul 09 '20 at 17:38
  • That is to say, it would recognize these branches perfectly, but completely mess up on the rest. – KGM Jul 09 '20 at 17:46
  • That's generally not what happens. The branch predictor has enough entries that it's very hard to confuse it with normal code. Also, unless the relevant code is hot, it's unlikely that the occasional misprediction really makes a difference. – fuz Jul 09 '20 at 17:50
  • so you say hash collisions happen seldomly, and thus the data of the other branches has no big impact on the prediction of every single branch? What do you mean by "the relevant code"? If you can explain why this is no big deal, I'd accept this as an answer... Heck, It would actually be a great answer! – KGM Jul 09 '20 at 17:51
  • Yes. Hash collisions are rare because usually, only a small section of code is really critical to performance and the branch predictor has plenty space in its tables for the branches in that. With the “relevant code” I mean the code in which branches are mispredicted. Lastly, have you actually measured that these conditional branches are a problem? – fuz Jul 09 '20 at 17:57
  • Firstly, no, i have not measured that... i assumed it. I can't measure it because I don't have all modern CPUs at hand. (I want the program to run quickly on any modern device) Secondly: Yes, now I understand your argument! Its pretty applicable to most small programs... however, I want my stuff to be scalable, so you should be able to efficiently run programs with "reasonably large" relevant code sections as well as regular programs without performance drops due to that issue... However, I neither know how large "reasonably large" means in practice, and I don't know how large would be bad... – KGM Jul 09 '20 at 18:04
  • If you can show that "reasonably large" relevant code sections (for example that of an entire OS or a database) are small enough for that not to be an issue, that would solve the case... IDK... maybe even complex things run on the same 1000 lines of code most of the time... – KGM Jul 09 '20 at 18:07
  • 2
    @fuz: Polluting / diluting the global history sounds like a valid concern for a modern IT-TAGE predictor (Haswell, Zen(?)) where the *index* for a prediction is based on history of the last `n` branches (like 15 to 20?). But I think the OP should try to benchmark some realistic case with a never-taken conditional branch vs. a plain `nop` just as a best-case baseline for fall-through. (Or *just* a `cmp` instruction padded for instruction length, with no `jcc`, to keep front-end alignment identical to cmp/near-jcc.) It might not be all that bad. – Peter Cordes Jul 09 '20 at 19:29
  • I have two problems with benchmarking that, 1 theoretical & 1 practical: The theoretical problem is that I don't want to design software for one CPU, but rather try to make something efficient for all x86-64 CPUs. Even if this works well on my CPU, who says it would work well on other CPUs too? The practical problem, which by far outweighs the theoretical problem, is that I am merely starting to learn the theoretical basis of code optimisation. I don't know how to make a proper benchmark, you know, one where the time is not influenced by whether the os temporarily stops the thread or not etc.. – KGM Jul 09 '20 at 23:53
  • Therefore, I would like this to be solved on a purely theoretical basis... – KGM Jul 09 '20 at 23:54
  • @KGM The branch predictors of modern processors all follow similar principles. While there are a few quirks with some branch predictors, whether a certain piece of code performs well on your CPU is usually a good indicator on how good it performs on others. Plus, if you post your benchmark code, other's can run it on their CPU and we can gather results for multiple µarchs to conclusively find out whether it's really a problem. – fuz Jul 10 '20 at 00:07
  • @KGM About “reasonably large” code sections: most code in a modern OS is almost never executed and certainly not critical to performance. It's common wisdom that the performance critical parts of a program are generally very small, even in large programs. And once a small part is executed again and again, the cache very quickly learns to predict its branches correctly. – fuz Jul 10 '20 at 00:09
  • @fuz Ok... ill make some benchmark code and update the question... Regarding the question of whether the critical parts of programs are small enough, I have to say: I would love it if it were so, but I need some quantitative evidence... After all, too large relevant code sections may, if the CPU does not cope with it well, really screw up efficiency. – KGM Jul 10 '20 at 00:15
  • @KGM It's common wisdom really. You can use a tool like `pprof` to analyse existing software for its hot spots. You can find that hot spots larger than about a kilobyte are essentially nonexistent. And the branch predictor is able to cope with that just fine. – fuz Jul 10 '20 at 00:20
  • @fuz "You can find that hot spots larger than about a kilobyte are essentially nonexistent." -> 1 kb hot-spot ~ maybe 300 lines hot-spot ~ maybe 150 "undead" branches and 100 normal branches... calculated pessimistically! so yes, if you can prove this, you have proven that my concern is not really worth concerning for usual software... and I could move on... really, that would be perfect! – KGM Jul 10 '20 at 00:25
  • It would pretty much answer the question! – KGM Jul 10 '20 at 00:32
  • @KGM I cannot make this an answer because I do not have done systematic research. Perhaps someone else has? – fuz Jul 10 '20 at 08:21
  • @fuz by now I tried to find statistics but got nothing but crap... The trouble starts with telling google that you want some statistics about the size of the "relevant code", google does not get it right what you mean by "relevant code". and gives you a bunch of partly programming related crap! I also tried alternate formulations, but still no statistics! So I suspect that answering this question about reasonable relevant code size is extremely difficult and thus not worth the time! Thus Ill just make some benchmarking code and well see if this is an issue at all... If it isn't, then all's ok. – KGM Jul 10 '20 at 14:28
  • @KGM Try to search for “hot spots”. That's the usual term for these sections. – fuz Jul 10 '20 at 14:36
  • Okay... Ill do that. – KGM Jul 10 '20 at 14:44
  • I tried but didn't find anything useful... best i found is https://people.engr.ncsu.edu/ermurph3/papers/icsm11.pdf and https://docs.enterprise.codescene.io/versions/2.8.0/guides/technical/hotspots.html#explore-the-hotspot-activity ... the first is useless because it describes frequently changed parts of code as a hotspot, the second is just like an advertisement for some hotspot finding tool... most I found was related to spatial hotspots (medicine, geography, etc.) it's hard to tell google you're actually referring to code hotspots because the spatial ones are detected with well ... code... – KGM Jul 10 '20 at 15:20

0 Answers0