Huge differences in implementations?

Question

I am writing a few functionalities for distributions and used the normal-distribution to run the tests between my implementation and C++ Boost.

Given the Probability Density Function (pdf: http://www.mathworks.com/help/stats/normpdf.html)

Which I wrote like that:

double NormalDistribution1D::prob(double x) {
    return (1 / (sigma * (std::sqrt(boost::math::constants::pi<double>()*2))))*std::exp((-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma)));
}

To compare my results the way how it is done with C++ Boost:

    boost::math::normal_distribution <> d(mu, sigma);
    return boost::math::pdf(d, x);

I'm not very positively surprised - my version took 44278 nano Seconds, boost only 326.

So I played a bit around and wrote the method probboost in my NormalDistribution1D-Class and compared all three:

void MATTest::runNormalDistribution1DTest1() {
    double mu = 0;
    double sigma = 1;
    double x = 0;

    std::chrono::high_resolution_clock::time_point tn_start = std::chrono::high_resolution_clock::now();
    NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
    double nres = n->prob(x);
    std::chrono::high_resolution_clock::time_point tn_end = std::chrono::high_resolution_clock::now();

    std::chrono::high_resolution_clock::time_point tdn_start = std::chrono::high_resolution_clock::now();
    NormalDistribution1D *n1 = new NormalDistribution1D(mu, sigma);
    double nres1 = n1->probboost(x);
    std::chrono::high_resolution_clock::time_point tdn_end = std::chrono::high_resolution_clock::now();

    std::chrono::high_resolution_clock::time_point td_start = std::chrono::high_resolution_clock::now();
    boost::math::normal_distribution <> d(mu, sigma);
    double dres = boost::math::pdf(d, x);
    std::chrono::high_resolution_clock::time_point td_end = std::chrono::high_resolution_clock::now();

    std::cout << "Mu : " << mu << "; Sigma: " << sigma << "; x" << x << std::endl;
    if (nres == dres) {
        std::cout << "Result" << nres << std::endl;
    } else {
        std::cout << "\033[1;31mRes incorrect: " << nres << "; Correct: " << dres << "\033[0m" << std::endl;
    }


    auto duration_n = std::chrono::duration_cast<std::chrono::nanoseconds>(tn_end - tn_start).count();
    auto duration_d = std::chrono::duration_cast<std::chrono::nanoseconds>(td_end - td_start).count();
    auto duration_dn = std::chrono::duration_cast<std::chrono::nanoseconds>(tdn_end - tdn_start).count();

    std::cout << "own boost: " << duration_dn << std::endl;
    if (duration_n < duration_d) {
        std::cout << "Boost: " << (duration_d) << "; own implementation: " << duration_n << std::endl;
    } else {
        std::cout << "\033[1;31mBoost faster: " << (duration_d) << "; than own implementation: " << duration_n << "\033[0m" << std::endl;
    }
}

The results are (Was compiling and running the checking-Method 3 times)

own boost: 1082 Boost faster: 326; than own implementation: 44278

own boost: 774 Boost faster: 216; than own implementation: 34291

own boost: 769 Boost faster: 230; than own implementation: 33456

Now this puzzles me: How is it possible the method from the class takes 3 times longer than the statements which were called directly?

My compiling options:

g++ -O2   -c -g -std=c++11 -MMD -MP -MF "build/Debug/GNU-Linux-x86/main.o.d" -o build/Debug/GNU-Linux-x86/main.o main.cpp

g++ -O2    -o ***Classes***

Calculating `(x - mu) / sigma` is wasteful. `1 / 2` is `.5`. Are you measuring code compiled with optimization enabled? Timing is more reliable if you perform the calculation in a loop many times. Aim to run for a few seconds at least and divide the total time by the iteration count. — David Heffernan, Aug 05 '15 at 20:11
`(x - mu ) / sigma` with those values was just the first test. But no, I don't know much about optimization in c++ (I'm a Java-Developer) — Qohelet, Aug 05 '15 at 20:15
Sorry, I meant calculating `(x - mu ) / sigma` twice was wasteful. You now need to find out whether you are compiling optimize code. Only you can know that. We can't see your compiler options. — David Heffernan, Aug 05 '15 at 20:17
How would you recommend me to write `(x - mu) / sigma`? I read a bit general stuff about optimisation and, now the result is: Boost faster: 48; than own implementation: 46248 (for better reading purposes I wrote the options in the question) — Qohelet, Aug 05 '15 at 20:21
Sry, got it. Any way, there's not much difference now: `double musigma = (x - mu) / sigma; return (1 / (sigma * (std::sqrt(boost::math::constants::pi()*2))))*std::exp(-0.5)*(musigma*musigma));` Boost faster: 78; than own implementation: 46570 — Qohelet, Aug 05 '15 at 20:29
Is boost special casing this? Try using more general values of mu, sigma and x. It probably also does not call sqrt(2pi), having that value as a const. — David Heffernan, Aug 05 '15 at 20:33
You might be right about that. But the question was mostly why boost is directly three times faster than called within the class: `Mu : 0; Sigma: 1; x0 Result0.398942 Boost faster: 35; than own implementation: 37889 Mu : 122; Sigma: 50; x2.5 Result0.000458747 Boost faster: 35; than own implementation: 5290 Mu : 5520; Sigma: 599; x4529.5 Result0.000169717 Boost faster: 57; than own implementation: 717` — Qohelet, Aug 05 '15 at 20:44
Oh. I thought you wanted to know why your code was 100x slower. Never mind. — David Heffernan, Aug 05 '15 at 20:48
Nevermind, the outcome of your ideas was interesting to see. In the end My code became strangely 50 times faster while boost slowed down by two times — Qohelet, Aug 05 '15 at 20:50

score 3 · Answer 1 · answered Aug 05 '15 at 20:47

Firs of all, you are allocating your object dynamically, with new:

NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
double nres = n->prob(x);

If you did just like you have done with boost, that alone would be sufficient to have the same (or comparable) speed:

NormalDistribution1D n(mu, sigma);
double nres = n.prob(x);

Now, I don't know if the way you spelled out your expression in NormalDistribution1D::prob() is significant, but I am skeptical that it would make any difference to write it in a more "optimized" way, because arithmetic expressions like that is the kind of thing compilers can optimize very well. Maybe it will get faster if you user the --ffast-math switch, that will give more optimization freedom to the compiler.

Also, if the definition of double NormalDistribution1D::prob(double x) is in another compilation unit (another .cpp file), the compiler will be unable to inline it, and that will also make a noticeable overhead (maybe twice slower or less). In boost, almost everithing is implemented inside headers, so inlines will always happen when the compiler seems fit. You can overcome this problem if you compile and link with gcc's -flto switch.

For me as a Java-Person this is a huge surprise to see, much thanks for telling me it took much more time. Now we are 195:6437 and 112:171. With the headers I can't do much about it, I have to separate. Anyway, the new performance-differences are amazing — Qohelet, Aug 05 '15 at 21:02
Both of them. And I'm positively surprised by the results. Even though the performance-boost of almost 90% came from the removal of `new`. Many thanks I didn't know that — Qohelet, Aug 06 '15 at 07:01

score 2 · Answer 2 · answered Aug 05 '15 at 20:48

You didn't compile with the -ffast-math option. That means the compiler cannot (in fact, must not!) simplify (-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma)) to a form similar to that used in boost::math::pdf,

expo = (x - mu) / sigma
expo *= -x
expo /= 2
result = std::exp(expo)
result /= sigma * std::sqrt(2 * boost::math::constants::pi<double>())

The above forces the compiler to do the fast (but potentially unsafe / inaccurate) calculation without resorting to using -ffast_math.

Secondly, the time difference between the above and your code is minimal compared to the time needed to allocate from the heap (new) vs. the stack (local variable). You are timing the cost of allocating dynamic memory.

Huge differences in implementations?

2 Answers2