3

The BOLT binary optimizer recommends using perf to profile binaries for optimization. However, if "perf is not available" they have an llvm-bolt mode which can also profile the application:

If perf record is not available to you, you may collect profile by first instrumenting the binary with BOLT and then running it.

Evidently, this is presented as a "second choice" by the BOLT authors.

What is the downside of this mode in terms of instrumentation quality? Evidently it is slower to collect the instrumentation, but is less accurate or effective at producing as input to the subsequent BOLT optimize call which produces a final optimized binary?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Of course it's at least a little slower and probably also less accurate, perf being what it is. But whether that slowness is significant, and how likely the loss of accuracy is… I'll vote to close since things like that tend to attract opinionated answers rather than factual ones. – arnt Jan 23 '23 at 08:58
  • @arnt - sorry which is less accurate? I would expect that instrumentation based approach to be more accurate (exact, in fact) for many of the things BOLT needs to know comparing to any sampling approach. – BeeOnRope Jan 23 '23 at 14:01
  • 3
    _I'll vote to close since things like that tend to attract opinionated answers rather than factual ones_ this is a really niche topic and I would expect a simple factual answer from e.g., one of the BOLT authors or the few people who care. – BeeOnRope Jan 23 '23 at 14:02
  • 1
    @arnt - note that I'm not asking about what's better between two different pieces or software or anything like that but between slightly different modes of operation for the same software. – BeeOnRope Jan 23 '23 at 14:03
  • Instrumentation adds code and often adds load to the memory bus. If you're using it to answer questions like e.g. "how many calls to x does parsing this file cause", that's of no consequence, if you're asking to measure performance (something per second) then the added code/bus traffic often matters.Or *seldom* matters, in some people's opinion, and I've heard some frightfully unproductive discussions of this. There's no rule and hardly even a rule of thumb, because modern CPUs are too complex for rules of thumb. – arnt Jan 25 '23 at 11:00
  • I've seen instrumentation affect the branch prediction cache badly, and also not affect it. I've seen instrumentation turn the main memory bus into a bottleneck, and also not. – arnt Jan 25 '23 at 11:06
  • @arnt - I think you misunderstood my question. Obviously instrumentation dramatically changes the way the binary runs, but I'm not sure that matters here: the instrumented binary is used to calculate statistics like "how many branches were taken", and then a non-instrumented binary is rebuilt using that information. That metric has an exact answer for any deterministic program with the same inputs, instrumented or not. So regardless of how fast the binary runs, you will get the same answer. – BeeOnRope Jan 25 '23 at 14:37
  • 1
    My _impression_ is that `perf-based` profiling is just a shortcut to getting these answers without instrumenting the binary: counting-based perf approaches can give the same type of exact answers, though sampling ones (as are used here) give an approximate one. So my impression is that `llvm-bolt` instrumentation mode will give _slightly more accurate_ profiles, at the cost of instrumentation. Since the final binary has no instrumentation in either case, this would result in a better outcome if you accept the time penalty for the profiling step. – BeeOnRope Jan 25 '23 at 14:39
  • 1
    That's what I want to confirm here. This has a specific answer than an expert in BOLT could probably dash off in a few sentences. – BeeOnRope Jan 25 '23 at 14:39

0 Answers0