FMA performance compared to naive calculation

Question

I'm trying to compare FMA performance (fma() in math.h) versus naive multiplication and addition in floating point computing. Test is simple. I am going to iterate same calculation for large iteration number. There are two things I have to achieve for precise examination.

No other computing should be included in counting time.
Naive multiplication and addition should not be optimized to FMA
Iteration should not be optimized. i.e. iteration should be carried out exactly as much as I intended.

To achieve above things, I did following:

Function is inline and only required computation is included.
Used g++ -O0 option not to optimize the multiplication. (But when I look into dump file it seems to generate almost same code for both)
Used volatile.

But the results shows almost no difference, or even slower fma() compared to naive multiplication and addition. Is it the result as I intended (i.e. they are not really different in terms of speed) or am I doing something wrong?

Spec

Ubuntu 14.04.2
G++ 4.8.2
Intel(R) Core(TM) i7-4770 (3.4GHz, 8MB L3 cache)

My Code

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
using namespace chrono;

inline double rand_gen() {
    return static_cast<double>(rand()) / RAND_MAX;
}

volatile double a, b, c;
inline void pure_fma_func() {
    fma(a, b, c);
}
inline void non_fma_func() {
    a * b + c;
}


int main() {
    int n = 100000000;

    a = rand_gen();
    b = rand_gen();
    c = rand_gen();

    auto t1 = system_clock::now();
    for (int i = 0; i < n; i++) {
        non_fma_func();
    }
    auto t2 = system_clock::now();
    for (int i = 0; i < n; i++) {
        pure_fma_func();
    }
    auto t3 = system_clock::now();

    cout << "non fma" << endl;
    cout << duration_cast<microseconds>(t2 - t1).count() / 1000.0 << "ms" << endl;
    cout << "fma" << endl;
    cout << duration_cast<microseconds>(t3 - t2).count() / 1000.0 << "ms" << endl;
}

I compiled with: `g++ test.cpp -mfma -O0 -o test` and result shows about 250ms for both. — Jongbin Park, Mar 23 '15 at 19:42
Compare assembler from `-O0` with `-O2` or `-O3` and see how much of junk is removed. Especially jumps, loads and stores, which can be expensive. You kinda bloated the test. — luk32, Mar 23 '15 at 19:57

score 8 · Accepted Answer · answered Mar 23 '15 at 19:44

Yes, you are doing something completely wrong. At least two somethings. But let's keep it simple.

Used g++ -O0 option not to optimize the multiplication

This renders your whole results completely irrelevant. Fun fact: the cost of the function call is probably more than the cost of the the calculation in either case.

Fundamentally, the results of benchmarks without optimizations enabled are completely meaningless. You can't just turn them off and hope for the best. They absolutely must be enabled.

Secondly, FMA vs regular multiply-and-add is a complex situation- there are things like latency vs throughput and other matters where multiply-and-add can be a winner.

In short, your benchmark is not a benchmark at all, it's just a bunch of random instructions that produce meaningless junk.

If you want an accurate benchmark, you must accurately reproduce the actual using circumstances- entirely. Including surrounding code, compiler optimizations, the whole shebang.

FMA performance compared to naive calculation

1 Answers1