0

I have a simple template header containing 3 templated functions (no declarations, just definitions and marked static inline), two of these functions being 5000 lines long. These long functions are very simple, but are long because they are in strainghtline program form / no loops. On my main program file where I use an instantiation of the template, if I include the template file directly, the program runs about 10x slower than if I build a separate c++ file to include the template and instantiate it, and link to it as a static library (-fPIC used). Why?

Is the compiler too slow, the instruction cache is getting messed up, the compiler suddenly inlined the long functions when it shouldn’t, or something else?

Code is highly optimized, being compiled with flags: -O3 -ffast-math -march=native -std=gnu++11 and GCC 5.5.0 in Mac OS 10.14.3.

rfabbri
  • 239
  • 2
  • 13
  • 1
    By "templated static functions" do you mean you're using the `static` keyword at global scope? This problem is likely extremely compiler/arch/program-specific. Do you have a link for a buildable example? – Cruz Jean Mar 15 '19 at 00:38
  • @CruzJean yes `static` keyword, global scope. I'll make an example, but given the downvotes perhaps this is pointless. – rfabbri Mar 15 '19 at 00:43
  • 1
    I really have no idea why this is getting downvoted other than maybe because it doesn't have a MWE. – Andrey Mishchenko Mar 15 '19 at 01:51
  • When you say "program runs about 10x slower", do you actually mean to say the *compilation time* is 10x slower? – AndyG Mar 15 '19 at 02:35
  • @AndyG I mean runtime. – rfabbri Mar 15 '19 at 02:36
  • @rfabbri: Hmm, in this case I'm not sure we'll be much help without a [mcve]. In the general case I would assume the resulting assembly to be identical, but perhaps it is not. Or perhaps something else is happening. – AndyG Mar 15 '19 at 02:38
  • @AndyG I would have to generate an example of a 5000 line function giving a similar behavior. The one I have will be open sourced soon but is not there yet. – rfabbri Mar 15 '19 at 02:45
  • Unrelated: *two of these functions being 5000 lines long* I prey to every deity that's listening you will never have to troubleshoot these functions – user4581301 Mar 15 '19 at 18:12
  • @user4581301 they implement are automatically generated expressions from symbolic software – rfabbri Mar 15 '19 at 19:14

2 Answers2

0

If you declare the function template to be static, doesn't that cause one copy of it to be generated per translation unit (compiled object file)? It could be that this results in 3 copies of the method being generated and yeah, caching issues.

Does getting rid of the static keyword resolve the performance problems?

Andrey Mishchenko
  • 3,986
  • 2
  • 19
  • 37
  • How would executing one of three different functions cause more caching issues? – aschepler Mar 15 '19 at 01:43
  • If you have `f1`, `f2`, and `f3` and each one fits in a cache line (for some level of caching) but all three at once do not, and they do the same thing as `f`, then it's faster to call `f` three times than to call `f1(); f2(); f3();`. – Andrey Mishchenko Mar 15 '19 at 01:50
  • there is only one translation unit in the slow case, say, `main.cxx`. it includes the template file that has the function definitions directly (no declarations). Despite the long function bodies, the template is simple enough and will likely be used in few translation units simultaneously to be worth enabling using this template as a “header-only” library, so I need `static`. Also, I am using the `inline` keyword. Perhaps the compiler decided to really inline the huge functions giving a perf hit? I will try removing the static or inline keywords. – rfabbri Mar 15 '19 at 01:51
  • @aschepler if you declare a free function to be `static` in C/C++, its definition is restricted in visibility to the translation unit (compiled object file) of the source file where the definition lives (including if it is put there through `#include`). If you have a `static void f() { /* ... */ }` in `my_header.h` and include the header in 5 `.cc` files, then there will be 5 identical copies of the definition in the resulting `a.out` executable. – Andrey Mishchenko Mar 15 '19 at 01:55
  • Inlining can indeed have unpredictable performance detriments. So maybe that's the problem. – Andrey Mishchenko Mar 15 '19 at 01:57
  • @rfabbri Rule of thumb: never use `inline`. The compiler is almost always smarter than us. – Cruz Jean Mar 15 '19 at 04:57
  • so I did some tests: @AndreyMishchenko, it is not the inline, not the static. I put it into a different translation unit, compiled into a separate .o linked directly, without static library, still slow. Only fast when the instantiation in in a separate a static library. – rfabbri Mar 15 '19 at 14:37
  • I don't get it, if it's a function template, you don't generally control what translation unit it goes into since the definition is in the header. Can you clarify? – Andrey Mishchenko Mar 15 '19 at 15:30
0

The optimization flags were being left out when compiling the main program, perhaps a CMake bug. When compiling the template instantiation separately as a library, the optimization flags were being used, causing the program to be fast. I forced the optimization flags to be used in the main program with direct template inclusion and it now runs just as fast.

For the sake of curiosity: the inline and static keywords were harmless - removing them didn't alter the speed. In fact the compiler is not inlining the functions despite my hint, as it knows when it shouldn't. Forcing inlining using __attribute__((always_inline)) makes compilation very slow, and also runtime performance slows down a bit (2x).

rfabbri
  • 239
  • 2
  • 13