C++11 in conjunction with OpenMP gives slower executable

Question

I am trying to learn OpenMP and want to study speed-up using OpenMP. For this purpose, I have written the following small program:

#include <vector>
#include <cmath>

int main() {
    static const unsigned int testDataSize = 1 << 28;

    std::vector<double> a (testDataSize), b (testDataSize);

    for (int i = 0; i < testDataSize; ++i) {
        a [i] = static_cast<double> (23 ^ i) / 1000.0;
    }
    b.resize(testDataSize);

    #pragma omp parallel for
    for (int i = 0; i < testDataSize; ++i) {
        b [i] = std::pow(a[i], 3) * std::exp(-a[i] * a[i]);
        b [i] += std::pow(a[i], 5) * std::exp(-a[i] * a[i]);
        b [i] += std::pow(a[i], 7) * std::exp(-a[i] * a[i]);
        b [i] += std::pow(a[i], 9) * std::exp(-a[i] * a[i]);
        b [i] += std::pow(a[i], 11) * std::exp(-a[i] * a[i]);
        b [i] += std::pow(a[i], 13) * std::exp(-a[i] * a[i]);
        b [i] += std::pow(a[i], 15) * std::exp(-a[i] * a[i]);
        b [i] += std::pow(a[i], 17) * std::exp(-a[i] * a[i]);
        b [i] += std::pow(a[i], 19) * std::exp(-a[i] * a[i]);
        b [i] += std::pow(a[i], 21) * std::exp(-a[i] * a[i]);
    }

    return 0;
}

I compiled the above code either with or without the -std=c++11 directive. What I notice is that when I am using the -std=c++11 directive, my code runs about 8 times slower as without using this. I am using -O3 and gcc version 4.9.2 on a Linux Debian system. Furthermore, when I compare the execution times without using OpenMP, I do note a speed difference. Thus, it looks to me that there is a problem with the -std=c++11 and not with OpenMP.

In detail, I obtain the following execution times (a measured using the Linux time command)

Compilation with OpenMP and -std=c++11: 35.262s

Compilation only with OpenMP: 5.875s

Compilation with only -std=c++11: 2m12

Compilation without OpenMP and -std=c++11: 23.757s

What is the reason that the execution time is much slower when using -std=c++11?

Any help or suggestion is greatly appreciated!

I have tagged what, in my humble opinion, is the best answer. In follow-up of oLen's answer, I have made my own pow(double, int) function as given below:

double my_pow(double base, int exp) {
    double result = 1.0;

    while (exp) {
        if (exp & 1)
            result *= base;
        exp >>= 1;
        base *= base;
    }

    return result;
}

I am not sure whether this is the most efficient way to calculate the integer power of some base number, but using this function I get exactly the same results in terms of computational efficiency when compiling with or without std=c++11 fully in line with oLen's answer.

What is your question? "Why is c++11 so much slower?". What options are you using? What does the assembly look like? — Martin Bonner supports Monica, Dec 22 '15 at 15:59
@erip: Indeed; thanks for the correction. I have changed script into program in the question above. — Ivo Filot, Dec 22 '15 at 17:22

oLen · Accepted Answer · 2015-12-22T16:42:37.550

9

The reason is that the version without -std=c++11 uses std::pow(double,int), which is apparently not available in C++11, and faster than std::pow(double,double). If you replace your integers (3, 5, etc.) by doubles (3.0, 5.0, etc.), you will get the same speed.

EDIT: Here are my timings with g++ version 4.8.4:
Original version:
-O3 -fopenmp : 10.678 s
-O3 -fopenmp -std=c++11 : 36.994 s
Adding ".0" after the integers:
-O3 -fopenmp : 36.679 s
-O3 -fopenmp -std=c++11 : 36.938 s

edited Dec 22 '15 at 16:42

answered Dec 22 '15 at 16:00

oLen

5,177
1
32
48

@LightnessRacesinOrbit I updated the answer with the timings I got – oLen Dec 22 '15 at 16:43
Can you suggest a version that gives the lower figure for both compile modes? – Lightness Races in Orbit Dec 22 '15 at 17:02
@LightnessRacesinOrbit: For this particular code, it's trivially easy to get rid of the calls to `pow` entirely. `accum += power; power *= a[i] * a[i];` – Ben Voigt Dec 22 '15 at 17:06
@BenVoigt: I am offering the answer author the opportunity to expand their answer to provide the best approach. That may be the best approach. But it's for the OP in the answer, not for me in the comments. – Lightness Races in Orbit Dec 22 '15 at 17:08
@oLen: Thanks a lot for the clear answer! I never expected it to be such a 'silly' thing like that C++11 does not have a pow(double, int) version... – Ivo Filot Dec 22 '15 at 17:24
I cannot think of a solution to get the faster version with C++11. There is a small discussion about that in http://stackoverflow.com/questions/5627030/why-was-stdpowdouble-int-removed-from-c11 but there doesn't seem to be a solution – oLen Dec 22 '15 at 17:49
`double(*fptr)(double, int) = std::pow; fptr(3.14,2);`? – Yakk - Adam Nevraumont Dec 23 '15 at 15:41

Ben Voigt · Answer 2 · 2015-12-22T16:34:39.107

In addition to the function overload selection issue @oLen pointed out, you have false sharing, which is hurting parallelism. Don't access the array member in every statement, it is in memory directly adjacent to elements being modified in other threads, which causes thrashing of the cache coherency algorithm. Instead accumulate results in a temporary and only write the result array once:

for (int i = 0; i < testDataSize; ++i) {
    double accum = std::pow(a[i], 3) * std::exp(-a[i] * a[i]);
    accum += std::pow(a[i], 5) * std::exp(-a[i] * a[i]);
    accum += std::pow(a[i], 7) * std::exp(-a[i] * a[i]);
    accum += std::pow(a[i], 9) * std::exp(-a[i] * a[i]);
    accum += std::pow(a[i], 11) * std::exp(-a[i] * a[i]);
    accum += std::pow(a[i], 13) * std::exp(-a[i] * a[i]);
    accum += std::pow(a[i], 15) * std::exp(-a[i] * a[i]);
    accum += std::pow(a[i], 17) * std::exp(-a[i] * a[i]);
    accum += std::pow(a[i], 19) * std::exp(-a[i] * a[i]);
    accum += std::pow(a[i], 21) * std::exp(-a[i] * a[i]);
    b[i] = accum;
}

For that matter, calling std::exp(-a[i] * a[i]) only once and saving the result should help even the single-threaded case, since it's very difficult for the compiler to prove this common subexpression can be optimized. And best of all, factor that out of the entire calculation:

for (int i = 0; i < testDataSize; ++i) {
    double accum = std::pow(a[i], 3);
    accum += std::pow(a[i], 5);
    accum += std::pow(a[i], 7);
    accum += std::pow(a[i], 9);
    accum += std::pow(a[i], 11);
    accum += std::pow(a[i], 13);
    accum += std::pow(a[i], 15);
    accum += std::pow(a[i], 17);
    accum += std::pow(a[i], 19);
    accum += std::pow(a[i], 21);
    b[i] = accum * std::exp(-a[i] * a[i]);
}

@erip: `a[i]` is only being read, not written, so the false sharing doesn't cause problems (all cores can have the cache line in state `S` -- shared, read-only) — Ben Voigt, Dec 22 '15 at 16:35

score 3 · Answer 3 · answered Dec 22 '15 at 19:58

In top of excellent answer by @oLen, quick check shows that in previous libstdc++ pow(double, int) was just a thunk to __builtin_powi (double, int), which computes power via multiplication. It was found that in general it is impossible to produce the same result for pow(double, int) and pow(double, double(int)), thus to follow standard implementation in c++11 library was changed to use pow(double, double) and if second argument is an int there would be cast involved. Documentation for GCC was changed as well, and it is now stated that

— Built-in Function: double __builtin_powi (double, int)
    Returns the first argument raised to the power of the second. Unlike the pow function no guarantees about precision and rounding are made.

Link: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

C++11 in conjunction with OpenMP gives slower executable

3 Answers3