6

I am trying to compile generate C code which comes from large eng. models. The code generated is unlike what one would write, there are many unrolled loops, extensive use of macros, huge arrays which are manually indexed and most importantly the source files are massive (>1e6 lines).

When compiling these source files with O2 or O3, my compile times become unmanageably high: 10-30 mins per file. This is with both Clang and GCC. I can't follow the generated assembly code very well, so I am unsure about the quality of the optimisation. Compile-time can be reduced by not generating debug info or by turning off warnings, but these are small as compared to turning off optimisations. In terms of runtime, there is a noticeable difference between O0 and O2, so I cannot justify doing this. When compiling with -ftime-trace, I can see that the Clang frontend is responsible for > 90% of the time. The process is not bottlenecked by memory, it seems to be entirely CPU bound, according to htop.

Is there some preprocessing which I can do to improve the compile times? Will breaking up the source file into smaller chunks improve performance, why? Are compilers design to work with these huge source files? Are there any other compile options I should be aware of?

Surprisingly, MSVC on Windows with /O2 takes a fraction of the time that Clang and GCC take.

Example of compiler arguments: clang -m64 -Wno-everything -c -D_GNU_SOURCE -DMATLAB_MEX_FILE -ftime-report -DFMI2_FUNCTION_PREFIX=F2_Simulations_SteadyState_SteadyState2019MPU_ -DRT -I/opt/matlab/r2017b/extern/include -I/opt/matlab/r2017b/simulink/include -I/mnt/vagrant_shared/<path>/Source -I/mnt/vagrant_shared/<path>/export -fexceptions -fPIC -fno-omit-frame-pointer -pthread -O0 -DNDEBUG -std=c99 /mnt/vagrant_shared/<path>/some_file.c -o /mnt/vagrant_shared/<path>/some_obj.obj

Platform: CentOS 7 running on a virtual box VM. Clang 7, GCC 4.8 (I am stuck on these older versions because of other requirements).

Mansoor
  • 2,357
  • 1
  • 17
  • 27
  • How do the unoptimized and O1 times compare between the several compilers? – Steve Summit Nov 25 '19 at 23:31
  • 1
    I don't know much about optimization, but I know that gcc and clang do some pretty aggressive stuff. I can easily imagine that some of those aggressive optimizations are, say, O(N^2) in the number of statements per function. I know even less about MSVC, but I can imagine that it's not quite so aggressive. – Steve Summit Nov 25 '19 at 23:33
  • Can you quantify size of files - how many functions in each file, how many lines. Can you share one sample file (or reduced file), so that it will be possible to reproduce/explore ? – dash-o Nov 25 '19 at 23:37
  • @SteveSummit `O0` is ~1 mins for all with time difference being within the margin of error. `O1` is between 3-5 mins with Clang being faster than GCC. Obviously, for these quicker compiles, the other options become significant. – Mansoor Nov 25 '19 at 23:53
  • @Mansoor can you at least share build command line options. Is the code C/C++/.... How many classes/functions/methods per file ? – dash-o Nov 26 '19 at 05:44
  • 2
    gcc [documents](https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/Optimize-Options.html#Optimize-Options) individual optimization options. I'd try not using `-O2` and find which one are affecting the time the most. – AProgrammer Nov 26 '19 at 08:46
  • 2
    @Aprogrammer Well, there seems to be one named `-fexpensive-optimizations`, so that's a start. Is there a way of selectively turning them off rather than on? Do you just replace `-f` with `-fno-` – Mansoor Nov 26 '19 at 09:19
  • @Mansoor, I'd suggest that you remove your last update and use its content to make an answer. – AProgrammer Nov 26 '19 at 14:13
  • @AProgrammer I thought the same, just seemed weird to answer one's own question. – Mansoor Nov 26 '19 at 14:18
  • @Mansoor, that [isn't a problem](https://stackoverflow.com/help/self-answer). – AProgrammer Nov 26 '19 at 14:20
  • 3
    Gcc officially recommends using -O1 for huge generated sources. – Marc Glisse Nov 26 '19 at 23:42
  • @MarcGlisse Can you provide a reference? – Mansoor Nov 27 '19 at 09:11
  • 1
    related (and contains further links): https://stackoverflow.com/questions/57428822/compile-large-array-in-dymola – matth Nov 27 '19 at 13:38
  • 1
    @Mansoor: Marc Glisse is a GCC developer; you can take his word for it. (And it sounds reasonable to me, although I don't spent a lot of time looking at `-O1` output.) – Peter Cordes Dec 03 '19 at 05:49

1 Answers1

4

Following a suggestion made by @AProgrammer, replacing -O2 with a subset of the included optimisations yields substantial compile-time improvements with negligable runtime differences.

Specifically, I excluded:

-fcode-hoisting -fdevirtualize-speculatively -fexpensive-optimizations -fipa-bit-cp -fipa-icf -fipa-ra -fipa-vrp -fisolate-erroneous-paths-dereference -flra-remat -freorder-blocks-algorithm=stc -fstore-merging -fipa-reference -fipa-reference-addressable -fshrink-wrap-separate -fssa-backprop -fssa-phiopt

Some of these were only applicable to C++ anyhow. The resulting compile is ~3x faster. There may be other options included in -O3 which could be included with little compile-time penalty.


Others have suggested that both GCC and Dymola recommend -O1 as a good trade off between compile-time and run-time performance. Using some extra -f options on top of -O1 would be a good way to future-proof this against changes in effects / benefits of different GCC options.

Also, total compilation time (compile and link) is made worse by breaking up the source file into smaller chunk, as expected.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Mansoor
  • 2,357
  • 1
  • 17
  • 27
  • 1
    To be more future-proof against changes in GCC options, you probably want to enable some extra set of optimizations on top of a baseline of `-O1`. You don't want to miss anything really important and cheap in future GCC versions if options get renamed or internals get refactored between optimization passes. Future readers might also want to try using -O2 and selectively *disabling* things with `-fno-whatever`. You can see what options are enabled by looking at asm comments at the top of `-fverbose-asm -S` output. – Peter Cordes Dec 03 '19 at 05:51
  • When you did separate compile and link, did you parallelize compilation of separate `.c` files, like `make -j`? GCC can only use one CPU core per source file. That will defeat inlining of helper functions unless you use LTO (redoing whole-program optimization again at link time which would defeat the purpose), so you'd have to be careful which functions you group together. – Peter Cordes Dec 03 '19 at 05:55
  • 1
    Did you try `-march=native` to set tuning and ISA options? That introduces more peepholes to look for (like BMI1/BMI2), but on x86 it will make variable-count shifts more efficient on Intel Haswell and later (BMI2 SHLX / SHRX are single-uop). You're not auto-vectorizing so SIMD instructions probably won't get used; IDK if having AVX available would slow anything down. Hopefully not, and 3-operand no-destructive instructions can help with scalar FP code. So can FMA if GCC looks for that. (clang auto-vectorizes at `-O2`, gcc only enables it as part of `-O3`.) – Peter Cordes Dec 03 '19 at 05:58
  • @PeterCordes Regarding your first comment, I agree. Regarding the second, I do, but for simplicity of the question, I only refer to a single source file. To your third point, I'll look into that, I would have assumed that it would further increase the total compile time. – Mansoor Dec 03 '19 at 10:11
  • 1
    Right, `-march=native` is unlikely to speed up compilation, but if it doesn't cost time it should make better code that's worth it (you mentioned that `-O0` wasn't really usable so apparently performance of the generated code matters, so it's a matter of getting the most bang for your buck in terms of big speedups for small extra compile time, or removing one of the more-expensive `-f` opts). Especially on a CPU that supports FMA; there are scalar versions of `vfmadd` instructions. You might also want `-fno-trapping-math -fno-math-errno` so more math library functions can (more fully) inline. – Peter Cordes Dec 03 '19 at 10:17