2

I've created a very simple benchmark for illustration of short string optimization and run it on quick-bench.com. The benchmark works very well as for the comparison of SSO-disabled/enabled string class and the results are very consistent with both GCC and Clang. However, I realized that when I disable optimizations, the reported times are around 4 times faster than those observed with enabled optimizations (-O2 or -O3), both with GCC and Clang.

The benchmark is here: http://quick-bench.com/DX2G2AdxUb7sGPE-zLRa41-MCk0.

Any idea what may cause the unoptimized benchmark to run 4-times faster?

Unfortunately, I can't see the generated assembly; don't know where the problem is (the "Record disassembly" box is checked but has no effect in my runs). Also, when I run the benchmark locally with Google Benchmark, the results are as expected, i.e., the optimized benchmark runs faster.

I also tried to compare both variants in Compiler Explorer and the unoptimized one seemingly executes much more instructions: https://godbolt.org/z/I4a171.

Daniel Langr
  • 22,196
  • 3
  • 50
  • 93
  • 2
    Have you read the notes under *More* -> *About Quickbench*? The values shown are not absolute times, but relative to a noop (which will likely also be influenced by optimization flags). It also says that the benchmarks are potentially run on different environments and not comparable. – walnut Sep 30 '19 at 12:32
  • @uneven_mark You're likely right. Yes, I've read the notes, just some time ago and I didn't remember that part about _noop potentially influenced by optimization flags_. – Daniel Langr Sep 30 '19 at 13:08
  • Well that part is my own conclusion, not what is actually written. I have not looked at the source code, but if it is implemented in the straight-forward way, then I would expect the flags to affect the noop time. – walnut Sep 30 '19 at 13:11
  • @uneven_mark There are several versions of NOOP there: https://github.com/FredTingaud/quick-bench-back-end/blob/master/app.js#L29. Just if you are interested. One would need to explore the generated assembly code to find out what NOOP is translated to, but I can't see it there. Even tried another browser, without success. – Daniel Langr Sep 30 '19 at 13:27
  • 1
    You can see [here](http://quick-bench.com/aiyAwMAoirMtqSSk0gf2M8Cx1ac). So you need to adjust your timings by a factor 6, roughly, which will make it less surprising. – walnut Sep 30 '19 at 13:56

1 Answers1

2

So, as discussed in the comments, the issue is that quick-bench.com does not show absolute time for the benchmarked code, but rather time relative to the time a no-op benchmark took. The no-op benchmark can be found in the source files of quick-bench.com:

static void Noop(benchmark::State& state) {
    for (auto _ : state) benchmark::DoNotOptimize(0);
}

All benchmarks of a run are compiled together. Therefore the optimization flags apply to it as well.

Reproducing and comparing the no-op benchmark for different optimization levels one can see, that there is about a 6 to 7 times speedup from the -O0 to -O1 version. When comparing benchmark runs done with different optimization flags, this factor in the baseline must be considered to compare results. The 4x speed-up observed in the question's benchmark is therefore more than compensated and the behavior is really as one would expect.

One main difference in compilation of the no-op between -O0 and -O1 is that for -O0 there are some assertions and other additional branches in the google-benchmark code, that are optimized out for higher optimization levels.

Additionally at -O0 each iteration of the loop will load into register, modify, and store to memory parts of state multiple time, e.g. for decrementing the loop counter and conditionals on the loop counter, while the -O1 version will keep state in registers, making memory load/stores in the loop unnecessary. The former is much slower, taking at least a few cycles per iteration for necessary store-forwardings and/or reloads from memory.

walnut
  • 21,629
  • 4
  • 23
  • 59