I would probably stick with valgrind/callgrind:
Trying out the compile flags mavx
and mfma4
causes issues for me too on different processors: FMA4 is primarily an AMD feature, although support for it is filtering into Intel chips, whereas AVX is primarily an Intel feature (with support being filtered into AMD chips) however in benchmarks AVX on AMD, when supported, actually performs slower than using SSE1/2/3/4 (FMA4 fills in for SSE51, 2, 3).
Using both optimisations is perhaps not the best approach and may well lead to the behaviour you are experiencing, as they effectively stand in opposition of each other, being primarily designed for specific brands of processors. Try removing FMA4 if you are compiling for an Intel CPU that supports AVX and using FMA4 if compiling for an AMD processor that supports FMA4.
That having been said, the compiler will not allow the combination of multiply and add into an FMA because that would reduce 2 roundings to 1 rounding in FMA, hence, you would need to use a relaxed floating point model (something like -ffast-math *
) or fail in IEEE floating point compliance by converting a lutiply and add to an FMA. Not sure how it works when you call the intrinsics specifically, but the compiler might not optimise them based on flags as they are very specific instructions.
The FMA flag (mfma4
) on my Intel CPUs produces the same result reliably, with valgrind throwing similar hissy fits to the one you have posted, however it behaves fine on the AMD CPU machines, (I take it your processor is an Intel?):
vex amd64->IR: unhandled instruction bytes: 0xC4 0x43 0x19 0x6B 0xE5 0xE0 0xF2 0x44
vex amd64->IR: REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=1
vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0xC ESC=0F3A
vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0
This is from the test code below.
FMA3 Intrinsics: (AVX2 - Intel Haswell)
_mm_fmadd_pd(), _mm256_fmadd_pd()
_mm_fmadd_ps(), _mm256_fmadd_ps()
and many many more besides....
FMA4 Intrinsics: (XOP - AMD Bulldozer)
_mm_macc_pd(), _mm256_macc_pd()
_mm_macc_ps(), _mm256_macc_ps()
and many many more besides....
Notes
FMA offers support for features that were scheduled to be part of SSE5 such as:
XOP: Integer vector multiply–accumulate instructions, integer vector horizontal addition, integer vector compare, shift and rotate instructions, byte permutation and conditional move instructions, floating point fraction extraction.
FMA4: Floating-point vector multiply–accumulate.
F16C: Half-precision floating-point conversion.
Test Code
float vfmaddsd_func(float f1, float f2, float f3){
return f1*f2 + f3;
}
int main() {
float f1,f2,f3;
f1 = 1.1;
f2 = 2.2;
f3 = 3.3;
float f4 = vfmaddsd_func(f1,f2,f3);
printf("%f\n", f4);
return 0;
}