I'm trying to compare FMA performance (fma()
in math.h
) versus naive multiplication and addition in floating point computing. Test is simple. I am going to iterate same calculation for large iteration number. There are two things I have to achieve for precise examination.
- No other computing should be included in counting time.
- Naive multiplication and addition should not be optimized to FMA
- Iteration should not be optimized. i.e. iteration should be carried out exactly as much as I intended.
To achieve above things, I did following:
- Function is inline and only required computation is included.
- Used g++
-O0
option not to optimize the multiplication. (But when I look into dump file it seems to generate almost same code for both) - Used
volatile
.
But the results shows almost no difference, or even slower fma()
compared to naive multiplication and addition. Is it the result as I intended (i.e. they are not really different in terms of speed) or am I doing something wrong?
Spec
- Ubuntu 14.04.2
- G++ 4.8.2
- Intel(R) Core(TM) i7-4770 (3.4GHz, 8MB L3 cache)
My Code
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
using namespace chrono;
inline double rand_gen() {
return static_cast<double>(rand()) / RAND_MAX;
}
volatile double a, b, c;
inline void pure_fma_func() {
fma(a, b, c);
}
inline void non_fma_func() {
a * b + c;
}
int main() {
int n = 100000000;
a = rand_gen();
b = rand_gen();
c = rand_gen();
auto t1 = system_clock::now();
for (int i = 0; i < n; i++) {
non_fma_func();
}
auto t2 = system_clock::now();
for (int i = 0; i < n; i++) {
pure_fma_func();
}
auto t3 = system_clock::now();
cout << "non fma" << endl;
cout << duration_cast<microseconds>(t2 - t1).count() / 1000.0 << "ms" << endl;
cout << "fma" << endl;
cout << duration_cast<microseconds>(t3 - t2).count() / 1000.0 << "ms" << endl;
}