22

I am reading:
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

It first suggests:
In combination with -flto using this option (-fwhole-program) should not be used. Instead relying on a linker plugin should provide safer and more precise information.

And then, it suggests:
If the program does not require any symbols to be exported, it is possible to combine -flto and -fwhole-program to allow the interprocedural optimizers to use more aggressive assumptions which may lead to improved optimization opportunities. Use of -fwhole-program is not needed when linker plugin is active (see -fuse-linker-plugin).

Does it mean that in theory, using -fuse-linker-plugin with -flto always gets a better optimized executable than using -fwhole-program with -flto?

I tried to use ld to link with -fuse-linker-plugin and -fwhole-program separately, and the executables' sizes at least are different.

P.S. I am using gcc 4.6.2, and ld 2.21.53.0.1 on CentOS 6.

S.S. Anne
  • 15,171
  • 8
  • 38
  • 76
Hei
  • 1,844
  • 3
  • 21
  • 35
  • 2
    fwiw, following your quote - "Use of -fwhole-program is not needed when linker plugin is active (see -fuse-linker-plugin)." - we later see in the documentation - **"This option [`-fuse-linker-plugin`] is enabled by default when LTO support in GCC is enabled and GCC was configured for use with a linker supporting plugins (GNU ld 2.21 or newer or gold)."** - so i would guess that covers most reasonable modern installations of gcc. meaning they have a default option that makes `-fwhole-program` unnecessary. but this is just my interpretation of it all! – underscore_d Mar 13 '16 at 18:02
  • @underscore_d Great! Now, how do we turn the damned thing off?! (The fuse-linker-plugin, I mean.) Please see: https://stackoverflow.com/questions/68582122/gcc-10-3-1-1-fc32-build-failing-with-gcc-fatal-error-fuse-linker-plugin-b – Richard T Jul 30 '21 at 17:56

1 Answers1

9

UPDATE: See @PeterCordes comment below. Essentially, -fuse-linker-plugin is no longer necessary.

These differences are subtle. First, understand what -flto actually does. It essentially creates an output that can be optimized later (at "link-time").

What -fwhole-program does is assumes "that the current compilation unit represents the whole program being compiled" whether or not that is actually the case. Therefore, GCC will assume that it knows all of the places that call a particular function. As it says, it might use more aggressive inter-procedural optimizers. I'll explain that in a bit.

Lastly, what -fuse-linker-plugin does is actually perform the optimizations at link time that would normally be done as each compilation unit is performed. So, this one is designed to pair with -flto because -flto means save enough information to do optimizations later and -fuse-linker-plugin means actually do those optimizations.

So, where do they differ? Well, as GCC doc suggests, there is no advantage in principle of using -fwhole-program because that option assumes something that you then have to ensure is true. To break it, simply define a function in one .cpp file and use it in another. You will get a linker error.

Is there any advantage to -fwhole-program? Well, if you only have one compilation unit then you can use it, but honestly, it won't be any better. I was able to get different sized executables by using equivalent programs, but when checking the actual generated machine code, they were identical. In fact, the only differences that I saw were that line numbers with debugging information were different.

Mouna Apperson
  • 1,178
  • 6
  • 15
  • 3
    IIRC, linking with `gcc -O3 -march=native ... -flto` Just Works these days, and does all the necessary linker-plugin stuff. (Pass the same optimization options to the link command as when you compiled.) – Peter Cordes Aug 19 '19 at 01:00
  • @PeterCordes Do you know if `gcc -O2 -march=native ... -flto` also does this? Many times, particularly for programs with the machine code on the hot paths larger than the trace cache, `O2` outperforms `O3` and `O3` is best applied to select functions. And, thanks for the update! – Mouna Apperson Aug 19 '19 at 16:34
  • 1
    `-O2` also "just works" of course. But only `-O3` enables auto-vectorization. If any important loops can benefit from auto-vectorization it's usually worth the code-size cost. Modern x86 CPUs don't use a trace cache, that was only Pentium 4. The uop cache in Sandybridge / Ryzen isn't a *trace* cache; it doesn't follow jumps. Plus, legacy-decode bandwidth is vastly better than in P4, so uop cache misses are not a disaster. But yes, `O2` could be better overall for some large programs if L1i misses are a problem. – Peter Cordes Aug 20 '19 at 00:41
  • 3
    Note that modern GCC `-O3` does *not* enable `-funroll-loops`; that's only done for hot loops when you do PGO with `-fprofile-generate` /` `-fprofile-use`. Also, GCC 8 and later typically auto-vectorize with unaligned load/store instead of fully-unrolled prologue/epilogue to reach an alignment boundary. That was always bad and super-bloated, spending most of the code-size for an auto-vectorized loop on the startup / cleanup, not unrolling the actual important part at all! BTW, clang `-O2` *does* auto-vectorize, and unrolls small loops by default. – Peter Cordes Aug 20 '19 at 00:43