How can a GCC instrumented executable be faster than the non-instrumented?

Question

I'm benchmarking the overhead of GCC Profile-Guided Optimization on the SPEC benchmarks. I have some weird results with some benchmarks. Indeed, two of my benchmarks are running faster when instrumented.

The normal executable is compiled with: -g -O2 -march=native

The instrumented executable is compiled with: -g -O2 -march=native -fprofile-generate -fno-vpt

I'm using GCC 4.7 (The Google branch to be precise). The computer on which the benchmark is running has an Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz.

bwaves is a Fortran benchmark and libquantum

Here are the results:

bwaves-normal: 712.14 
bwaves-instrumented: 697.22 
  => ~2% faster 

libquantum-normal: 463.88
libquantum-instrumented: 449.05
  => ~3.2% faster

I ran the benchmarks several times thinking that it could be a problem on ma machine but each time I confirmed them.

I would understand a very small overhead on some programs, but I don't see any reason for an improvement.

So my question is: How can the GCC instrumented executable be faster than the optimized normal one ?

Thanks

I don't suppose this is reproducible on a small test case that we can build and experiment with? — NPE, Jan 30 '13 at 18:06
Unfortunately no and I can't share the sources/executable of SPEC as it is commercial :( — Baptiste Wicht, Jan 30 '13 at 18:17
It might be that your performance is determined by cache effects. Adding code will change which cache lines get hit; if you have bad interference when you have no instrumentation, you could see this. Any reason to believe your program touches a lot of cache? — Ira Baxter, Jan 30 '13 at 21:35
I don't know exactly, but I know that some of the benchmarks a very memory-intensive. Perhaps it is the case of this one. Indeed, it can changes the memory effects. — Baptiste Wicht, Jan 31 '13 at 00:03
Regarding the cache effects suggested by @IraBaxter, do you have prefetch instructions in that code? Adding code should not normally remove cache issues (cache size, memory b/w and latency don't magically change, the extra code might run "for free" though) but might indeed put more pressure on the cache. However, prefetching is a different story, it may actually be slower than not doing it, if not done early enough (or with a wrong access hint). In this case, "some more code" slightly delaying execution could therefore conceivably really make it run faster. — Damon, Jan 31 '13 at 12:02

score 1 · Answer 1 · answered Jan 31 '13 at 12:41

Looking at the GCC documentation, it looks like -fprofile-generate does activate some specific code transformations to make profiling easier/cheaper, so the instrumented code isn't really the original code + instrumentation. The changes could make the code faster, and adding code will also make the caching behaviour change. Hard to know without seeing the offending code. And from my (long ago) fooling around with LCC, when profiling is done intelligently it involves suprisingly little code changes.

Just curiosity: How does the code compiled taking the profile in consideration fare compared to the above?

For bwaves, PGO-compiled is 3.5% faster and for libquantum, it is almost 15% faster. — Baptiste Wicht, Jan 31 '13 at 16:42

score 1 · Accepted Answer · answered Jan 31 '13 at 14:16

I can think of two possibilities, both relating to cache.

One is that the counter increment "warms" some important cache lines. Second is that adding the structures required by instrumentation causes some heavily used arrays or variables to fall into different cache lines.

Another issue is that profiling / increasing a counter doesn't have to happen every time in a for loop -- if there's no 'break' or 'return' in a loop, a compiler is allowed to optimize the increment out of the loop.

How can a GCC instrumented executable be faster than the non-instrumented?

2 Answers2