In the BSD Library Functions Manual of FMA(3), it says "These functions compute x * y + z."
So what's the difference between FMA and a naive code which does x * y + z? And why FMA has a better performance in most cases?
In the BSD Library Functions Manual of FMA(3), it says "These functions compute x * y + z."
So what's the difference between FMA and a naive code which does x * y + z? And why FMA has a better performance in most cases?
[ I don't have enough karma to make a comment; adding another answer seems to be the only possibility. ]
Eric's answer covers everything well, but a caveat: there are times when using fma(a, b, c)
in place of a*b+c
can cause difficult to diagnose problems.
Consider
x = sqrt(a*a - b*b);
If it is replaced by
x = sqrt(fma(a, a, -b*b));
there are values of a
and b
for which the argument to the sqrt
function may be negative even if |a|>=|b|
. In particular, this will occur if |a|=|b|
and the infinitely precise product a*a
is less than the rounded value of a*a
. This follows from the fact that the rounding error in computing a*a
is given by fma(a, a, -a*a)
.
a*b+c
produces a result as if the computation were:
a
and b
.c
.fma(a, b, c)
produces a result as if the computation were:
a
and b
.c
.So it skips the step of rounding the intermediate product to the floating-pint format.
On a processor with an FMA instruction, a fused multiply-add may be faster because it is one floating-point instruction instead of two, and hardware engineers can often design the processor to do it efficiently. On a processor without an FMA instruction, a fused multiply-add may be slower because the software has to use extra instructions to maintain the information necessary to get the required result.