How to benefit most from precompiled headers with gcc?

Question

I have a C++ project with many targets that include a lot of boost header files and other line-intensive headers. Most of the targets include the same headers. Thus, I thought this might be ideal to use precompiled headers (pch). So I created a header file with the most included headers and precompiled it.

This reduced the lines of code of the compilation unit from 350k to 120k (I passed the -save-temps flag to the gcc to check that). I checked that it was used with the -H parameter and the pch has a exlamation mark in front of it. The precompiled header has 550MB.

Though, the compile time was only reduced from 23 seconds to 20 seconds.

Is this little of improvement to be expected from precompiled headers? If not, what am I doing wrong? What speeds the compilation time with precompiled headers most?

Edit: This is the gcc command:

/usr/bin/c++
-fPIC -I/projectDir/build/source -I/projectDir/source -I/usr/include/eigen3 -include /projectDir/build/source/Core/core/cotire/Core_ORIGINAL_CXX_prefix.hxx -Winvalid-pch -g -Wall -Wextra -Wno-long-long -Wno-unused-parameter -std=c++0x -DBOOST_ENABLE_ASSERT_HANDLER -D_REENTRANT -o CMakeFiles/SubProject.dir/cotire/SubProject_ORIGINAL_CXX_unity.cxx.o -c /projectDir/build/source/ArmarXCore/statechart/cotire/SubProject_ORIGINAL_CXX_unity.cxx

The output of passing -ftime-report gives me (with PCH enabled):

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1321 kB ( 0%) ggc
 phase parsing           :   7.29 (32%) usr   1.69 (51%) sys   8.99 (35%) wall 1135793 kB (54%) ggc
 phase lang. deferred    :   2.75 (12%) usr   0.40 (12%) sys   3.15 (12%) wall  317920 kB (15%) ggc
 phase opt and generate  :  12.03 (53%) usr   1.17 (36%) sys  13.22 (51%) wall  622545 kB (30%) ggc
 phase check & debug info:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     440 kB ( 0%) ggc
 phase last asm          :   0.63 ( 3%) usr   0.02 ( 1%) sys   0.64 ( 2%) wall   26440 kB ( 1%) ggc
 phase finalize          :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 |name lookup            :   1.30 ( 6%) usr   0.29 ( 9%) sys   1.42 ( 5%) wall  153617 kB ( 7%) ggc
 |overload resolution    :   3.37 (15%) usr   0.59 (18%) sys   3.30 (13%) wall  360551 kB (17%) ggc
 garbage collection      :   1.80 ( 8%) usr   0.01 ( 0%) sys   1.82 ( 7%) wall       0 kB ( 0%) ggc
 dump files              :   0.11 ( 0%) usr   0.05 ( 2%) sys   0.18 ( 1%) wall       0 kB ( 0%) ggc
 callgraph construction  :   0.44 ( 2%) usr   0.10 ( 3%) sys   0.59 ( 2%) wall   26388 kB ( 1%) ggc
 callgraph optimization  :   0.21 ( 1%) usr   0.11 ( 3%) sys   0.23 ( 1%) wall   16131 kB ( 1%) ggc
 ipa free inline summary :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 cfg construction        :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall    2119 kB ( 0%) ggc
 cfg cleanup             :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall     169 kB ( 0%) ggc
 trivially dead code     :   0.05 ( 0%) usr   0.02 ( 1%) sys   0.13 ( 0%) wall       0 kB ( 0%) ggc
 df scan insns           :   0.30 ( 1%) usr   0.02 ( 1%) sys   0.38 ( 1%) wall    1126 kB ( 0%) ggc
 df live regs            :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.10 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.10 ( 0%) usr   0.03 ( 1%) sys   0.12 ( 0%) wall    7774 kB ( 0%) ggc
 register information    :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis          :   0.02 ( 0%) usr   0.02 ( 1%) sys   0.08 ( 0%) wall    2621 kB ( 0%) ggc
 rebuild jump labels     :   0.05 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 preprocessing           :   1.16 ( 5%) usr   0.45 (14%) sys   1.61 ( 6%) wall  209848 kB (10%) ggc
 parser (global)         :   0.43 ( 2%) usr   0.29 ( 9%) sys   0.83 ( 3%) wall  193966 kB ( 9%) ggc
 parser struct body      :   1.03 ( 5%) usr   0.20 ( 6%) sys   1.37 ( 5%) wall  199825 kB ( 9%) ggc
 parser enumerator list  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     574 kB ( 0%) ggc
 parser function body    :   0.53 ( 2%) usr   0.06 ( 2%) sys   0.49 ( 2%) wall   35252 kB ( 2%) ggc
 parser inl. func. body  :   0.13 ( 1%) usr   0.03 ( 1%) sys   0.14 ( 1%) wall   11720 kB ( 1%) ggc
 parser inl. meth. body  :   1.14 ( 5%) usr   0.19 ( 6%) sys   1.45 ( 6%) wall  115776 kB ( 6%) ggc
 template instantiation  :   4.11 (18%) usr   0.82 (25%) sys   4.78 (18%) wall  566245 kB (27%) ggc
 inline parameters       :   0.05 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall   12792 kB ( 1%) ggc
 tree gimplify           :   0.28 ( 1%) usr   0.03 ( 1%) sys   0.27 ( 1%) wall   55239 kB ( 3%) ggc
 tree eh                 :   0.19 ( 1%) usr   0.00 ( 0%) sys   0.14 ( 1%) wall   20091 kB ( 1%) ggc
 tree CFG construction   :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall   34452 kB ( 2%) ggc
 tree CFG cleanup        :   0.09 ( 0%) usr   0.02 ( 1%) sys   0.15 ( 1%) wall      27 kB ( 0%) ggc
 tree PHI insertion      :   0.01 ( 0%) usr   0.01 ( 0%) sys   0.01 ( 0%) wall    5960 kB ( 0%) ggc
 tree SSA rewrite        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    8035 kB ( 0%) ggc
 tree SSA other          :   0.04 ( 0%) usr   0.03 ( 1%) sys   0.12 ( 0%) wall    1604 kB ( 0%) ggc
 tree operand scan       :   0.06 ( 0%) usr   0.04 ( 1%) sys   0.08 ( 0%) wall   16681 kB ( 1%) ggc
 dominance frontiers     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.14 ( 1%) usr   0.04 ( 1%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 out of ssa              :   0.04 ( 0%) usr   0.03 ( 1%) sys   0.14 ( 1%) wall       8 kB ( 0%) ggc
 expand vars             :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.14 ( 1%) wall   10387 kB ( 0%) ggc
 expand                  :   0.79 ( 3%) usr   0.05 ( 2%) sys   0.77 ( 3%) wall   89756 kB ( 4%) ggc
 post expand cleanups    :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall   14796 kB ( 1%) ggc
 varconst                :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     532 kB ( 0%) ggc
 jump                    :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 integrated RA           :   4.92 (22%) usr   0.12 ( 4%) sys   4.54 (17%) wall  167029 kB ( 8%) ggc
 LRA non-specific        :   0.38 ( 2%) usr   0.01 ( 0%) sys   0.81 ( 3%) wall     776 kB ( 0%) ggc
 LRA virtuals elimination:   0.07 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall    6530 kB ( 0%) ggc
 LRA reload inheritance  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       4 kB ( 0%) ggc
 LRA create live ranges  :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      40 kB ( 0%) ggc
 LRA hard reg assignment :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 reload                  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue  :   0.16 ( 1%) usr   0.01 ( 0%) sys   0.26 ( 1%) wall   19997 kB ( 1%) ggc
 shorten branches        :   0.17 ( 1%) usr   0.01 ( 0%) sys   0.16 ( 1%) wall       0 kB ( 0%) ggc
 reg stack               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 final                   :   0.63 ( 3%) usr   0.04 ( 1%) sys   0.69 ( 3%) wall   29353 kB ( 1%) ggc
 symout                  :   1.28 ( 6%) usr   0.06 ( 2%) sys   1.23 ( 5%) wall  173563 kB ( 8%) ggc
 uninit var analysis     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 rest of compilation     :   0.81 ( 4%) usr   0.18 ( 5%) sys   0.93 ( 4%) wall   34415 kB ( 2%) ggc
 unaccounted todo        :   0.25 ( 1%) usr   0.16 ( 5%) sys   0.39 ( 1%) wall       0 kB ( 0%) ggc
 TOTAL                 :  22.71             3.29            26.03            2104543 kB

thanks veio

How much of your current time is compilation and how much is linking? How many translation units do you have to compile (i.e. how many times is g++ invoked)? — John Zwinck, Dec 20 '15 at 13:53
I have a unity build (see [cotire](https://github.com/sakra/cotire)), so only one translation unit. Linking takes around ~1 sec. — veio, Dec 20 '15 at 14:21
What compile options are you using? I've found, working on my own compiler (and using LLVM, not GCC, admittedly, but I have noticed similar things in GCC) that it's typically code generation that is a big part of the compile time - which means that optimising parsing (PCH is basically the binary form of the parsed header files) and file handling is not going to give much benefit in many cases - particularly if using >= -O1 to optimise code. — Mats Petersson, Dec 20 '15 at 14:33
These are the compileflags: `CXX_FLAGS: -g -Wall -Wextra -Wno-long-long -Wno-unused-parameter -std=c++0x -DBOOST_ENABLE_ASSERT_HANDLER -D_REENTRANT -fPIC` Do you mean these or something different? — veio, Dec 20 '15 at 14:36
What if you do a regular, non-unity build? The unity build concept defeats parallelism which could normally be used even on single systems, e.g. `make -j4` on a quad-core machine. At the end of the day, 20 seconds for a large C++ project is not that bad. — John Zwinck, Dec 20 '15 at 14:43
@JohnZwinck Without unity build, with pch and -j6 it takes also 23 seconds. Without PCH it takes 33 seconds. Though, I dont need parallel compilation of this target because I have other targets that will be build in parallel. Also you can tell cotire to split up the unity build in smaller parts for parallel computation. — veio, Dec 20 '15 at 14:54

score 1 · Answer 1 · answered Dec 20 '15 at 18:22

1

I haven't seen -ftime-report before. That actually gives some interesting info on the bottleneck.

phase opt and generate  :  12.03 (53%) usr   1.17 (36%) sys  13.22 (51%)

Half the time is spent optimizing, which PCH won't solve. PCH is meant to prevent include files being compiled per translation unit. A unity build is essentially on large translation unit, so re-compiling headers should not be a bottleneck. Unity builds generally imply it will take longer to optimize though, since compiler optimization normally isn't linear with respect to translation unit size.

However, since optimizing is generally designed for non-unity builds, one possible optimization might be using -flto instead. GCC LTO can be parallelized by passing a thread argument, -flto=8. The speedup will most likely be less than threads though, for obvious reasons. FYI, you might also need to switch your linker to ld.gold.

answered Dec 20 '15 at 18:22

Jason

3,777
14
27

I also have many small targets, maybe I should do some profiling there with and without pch. - regarding -flto: I dont really need parallelization of one target since I have many more targets. – veio Dec 20 '15 at 21:46
The small targets are executables and they dont even wanna use the pch. The pch is just listed in the list of "Multiple include guards may be useful for.." though the executable includes header from the pch. – veio Dec 20 '15 at 22:00
@veio [LTO](https://en.wikipedia.org/wiki/Interprocedural_optimization) is actually a different way of optimizing across translation units, not just parallelization. Unity builds are essentially a hack to get the same result. It looks like the unity build is the bottleneck, so `-flto=n` should give the biggest speedup. 30s is honestly good for a build though. For the smaller stuff, I might try `extern` templates. – Jason Dec 20 '15 at 23:21
The 30 seconds are just one lib, the whole thing takes w/o optimization and 8 cores ~50 minutes. I just tried `-flto=6` and a split up unity built (approx. 6 translation units per target), but it doesnt seem to have an effect. It even took even a few seconds longer. Do I need to do more for that option? – veio Dec 21 '15 at 00:51
@veio I normally pass as many cores as there are on the system compiling since it's CPU bound (I haven't tried SMT). It looks like it blocks on system, so you might even benefit from more threads than cores. Your next bottleneck looks like it's parsing though (32%), so you might want to try [distcc](https://github.com/distcc/distcc). You can cache entire compiled objects using [ccache](https://ccache.samba.org/) as well. If not a lot changes in the source, you should see a decent speedup without a penalty on performance for the binaries. You might want to verify that though. – Jason Dec 21 '15 at 01:14
@veio Just out of curiosity, is Clang/LLVM an option? – Jason Dec 21 '15 at 01:32
thanks for the tips, we also use ccache and know about distcc, but here I want to optimize on another level. With Clang the project does not compile unfortunately. – veio Dec 21 '15 at 22:07
@veio You're welcome. If you do distribute and cache the build using `-flto`, it would help others if you could post numbers. If you can get your code to compile under Clang, that will reduce compile times as well (multiplied'ish by the number of compile nodes). – Jason Dec 22 '15 at 03:00
Why should it be better with clang? Is clang just faster? I'll try to produce some numbers. So far was the unity build the most effective change. – veio Dec 22 '15 at 15:35
@veio Clang is a much faster compiler, but its optimizer isn't as mature. There are some benefits, but unfortunately there also some [pitfalls](http://hubicka.blogspot.com/2014/04/linktime-optimization-in-gcc-2-firefox.html) as well. Note, the optimizer has improved significantly since that blog post though. GCC LTO has improved significantly as well. – Jason Dec 22 '15 at 16:28

How to benefit most from precompiled headers with gcc?

1 Answers1