Difference between FMA and naive a*b+c?

Question

In the BSD Library Functions Manual of FMA(3), it says "These functions compute x * y + z."

So what's the difference between FMA and a naive code which does x * y + z? And why FMA has a better performance in most cases?

When you refer to other materials in a question, do not say just “in FMA.” Provide a complete title and version number for a document and/or a link to it. That is necessary to give essential context to other people. We do not know what FMA description you are referring to. Something in an Intel architecture manual? Something in the the C standard? Some amateur web site? — Eric Postpischil, Aug 22 '19 at 02:30

score 3 · Answer 1 · answered Aug 22 '19 at 20:25

[ I don't have enough karma to make a comment; adding another answer seems to be the only possibility. ]

Eric's answer covers everything well, but a caveat: there are times when using fma(a, b, c) in place of a*b+c can cause difficult to diagnose problems.

Consider

x = sqrt(a*a - b*b);

If it is replaced by

x = sqrt(fma(a, a, -b*b));

there are values of a and b for which the argument to the sqrt function may be negative even if |a|>=|b|. In particular, this will occur if |a|=|b| and the infinitely precise product a*a is less than the rounded value of a*a. This follows from the fact that the rounding error in computing a*a is given by fma(a, a, -a*a).

Can this run into any trouble? `sqrt((a-b)*(a+b))` – Rick James Nov 20 '19 at 19:20 — Rick James, Nov 20 '19 at 19:20

score 2 · Accepted Answer · answered Aug 22 '19 at 02:28

a*b+c produces a result as if the computation were:

Calculate the infinitely precise product of a and b.
Round that product to the floating-point format being used.
Calculate the infinitely precise sum of that result and c.
Round that sum to the floating-point format being used.

fma(a, b, c) produces a result as if the computation were:

Calculate the infinitely precise product of a and b.
Calculate the infinitely precise sum of that product and c.
Round that sum to the floating-point format being used.

So it skips the step of rounding the intermediate product to the floating-pint format.

On a processor with an FMA instruction, a fused multiply-add may be faster because it is one floating-point instruction instead of two, and hardware engineers can often design the processor to do it efficiently. On a processor without an FMA instruction, a fused multiply-add may be slower because the software has to use extra instructions to maintain the information necessary to get the required result.

(Of course, the hardware/software has tricks to avoid actually holding an "infinite" number of bits.) — Rick James, Nov 20 '19 at 19:21

Difference between FMA and naive a*b+c?

2 Answers2