C++ operator overload performance issue

Question

Consider following scheme. We have 3 files:

main.cpp:

int main() {   
    clock_t begin = clock();
    int a = 0;
    for (int i = 0; i < 1000000000; ++i) {
        a += i;
    }
    clock_t end = clock();
    printf("Number: %d, Elapsed time: %f\n",
            a, double(end - begin) / CLOCKS_PER_SEC);

    begin = clock();
    C b(0);
    for (int i = 0; i < 1000000000; ++i) {
        b += C(i);
    }
    end = clock();
    printf("Number: %d, Elapsed time: %f\n",
            a, double(end - begin) / CLOCKS_PER_SEC);
    return 0;
}

class.h:

#include <iostream>
struct C {
public:
    int m_number;
    C(int number);
    void operator+=(const C & rhs);
};

class.cpp

C::C(int number)
: m_number(number)
{
}
void 
C::operator+=(const C & rhs) {
    m_number += rhs.m_number;
}

Files are compiled using clang++ with flags -std=c++11 -O3.

What I expected were very similar performance results, since I thought that compiler will optimize the operators not to be called as functions. The reality though was a bit different, here is the result:

Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 5.375751

I played around a bit and found out, that if I paste all of the code from class.* into the main.cpp the speed dramatically improves and results are very similar.

Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 0.000003

Than I realized that this behavior is probably caused by the fact, that compilation of main.cpp and class.cpp is completely separated and therefore compiler is unable to perform adequate optimizations.

My question: Is there any way of keeping the 3-file scheme and still achieve the optimization level as if the files were merged into one and than compiled? I have read something about 'unity builds' but that seems like an overkill.

People who use `endl` have no right to complain about performance. — nwp, Jul 24 '14 at 10:03
You should really be calling `clock()` straight after the loops, not inside of `cout <<`. That would make your tests more meaningful. — juanchopanza, Jul 24 '14 at 10:04
In general, using `printf` will be faster. And some compilers can do link-time-code-generation. It's opt-in though. — Deduplicator, Jul 24 '14 at 10:04
Well originally I had it like you say, but than for keep-example-code-short sake I put it right into the cout — Jendas, Jul 24 '14 at 10:06
Well, you should change it back. As it is, it is just causing confusion. — juanchopanza, Jul 24 '14 at 10:08
Have you checked the generated code? The first loop looks simple enough to be optimized out completely, and I recall a benchmark from the 90s where gcc beat the Microsoft compiler in a benchmark by several orders of magnitude. It turned out gcc replaced a value calculated inside a loop by a simple assignment. (If int were big enough, the result for your example would be 999999999*1000000000/2.) EDIT: answer was off-by-one. — Axel, Jul 24 '14 at 10:09
@Axel: It's not about the performance of either loop, but their difference, when full optimization is done. — Deduplicator, Jul 24 '14 at 10:11
@Deduplicator I agree, it doesn't really matter what clang++ does with the loop, the point is, that it is unable to do it with the class in other file. — Jendas, Jul 24 '14 at 10:13
Could you elaborate about the "3-file scheme"? Depending on the real constraints, another possibility would be to define the functions not in the class.cpp, but in the class.h header (or perhaps a sub-header class.inline.h included by class.h, thus you keep three separate files), and define them as inline... The downside is lesser encapsulation (as everyone can see the code of the function). The upside is that optimization is easier, even without whole program/link time optimization. — paercebal, Jul 24 '14 at 11:23
@Deduplicator Yes, but what I meant was: I am not sure that the executable contains a loop for the first one at all (that optimization was possible 20 years ago), and that would render the whole comparison useless. — Axel, Jul 24 '14 at 12:17

score 19 · Accepted Answer · edited May 23 '17 at 12:23

Short answer

What you want is link time optimization. Try the answer from this question. I.e., try:

clang++ -O4 -emit-llvm main.cpp -c -o main.bc 
clang++ -O4 -emit-llvm class.cpp -c -o class.bc 
llvm-link main.bc class.bc -o all.bc
opt -std-compile-opts -std-link-opts -O3 all.bc -o optimized.bc
clang++ optimized.bc -o yourExecutable

You should see that your performance reaches the one that you had when pasting everything into main.cpp.

Long answer

The problem is that the compiler cannot inline your overloaded operator during linking, because it no longer has its definition in a form which it can use to inline it (it cannot inline bare machine code). Thus, the operator call in main.cpp will stay a real function call to the function declared in class.cpp. A function call is very expensive in comparison to a simple inlined addition which can be optimized further (e.g., vectorized).

When you enable link time optimization, the compiler is able to do this. As you see above, you first create llvm intermediate representation byte code (the .bc files, which I will simply call llvm code hereinafter) instead of machine code. You then link these files to a new .bc file which still contains llvm code instead of machine code. In contrast to machine code, the compiler is able to perform inlining on llvm code. opt is the llvm optimizer (be sure to install llvm), which performs the inlining and further link time optimizations. Then, we call clang++ a final time to generate executable machine code from the optimized llvm code.

For People with GCC

The answer above is only for clang. GCC (g++) users must use the -flto flag during compilation and during linking to enable link time optimization. It is simpler than with clang, simply add -flto everywhere:

      g++ -c -O2 -flto main.cpp
      g++ -c -O2 -flto class.cpp
      g++ -o myprog -flto -O2 main.o class.o

@Deduplicator: You mean that it should not return `void` so it can be used in an inner expression like `a+=b+=c`? I think that would clutter the answer with off-topic content. However, I leave this comment, so it can be read here, at least :). — gexicide, Jul 24 '14 at 10:19
Note that Visual C++ also offers this optimization as the "Whole Program Optimization": http://msdn.microsoft.com/en-us//library/0zza0de8.aspx — paercebal, Jul 24 '14 at 11:17

score 2 · Answer 2 · answered Jul 24 '14 at 10:04

2

The technique what you are looking for is called Link Time Optimization.

answered Jul 24 '14 at 10:04

erenon

18,838
2
61
93

score 0 · Answer 3 · answered Jul 24 '14 at 15:55

From the timing data, it is obvious that the compiler doesn't just generate better code for the trivial case, but that it doesn't perform any code at all to sum up a billion number. That doesn't happen in real life. You are not performing a useful benchmark. You want to test code that is at least complicated enough to avoid stupid/clever things like this.

I'd re-run the test, but change the loop to

for (int i = 0; i < 1000000000; ++i) if (i != 1000000) {
    // ... 
}

so that the compiler is forced to actually add up the numbers.

C++ operator overload performance issue

3 Answers3

Short answer

Long answer

For People with GCC