1

In the BSD Library Functions Manual of FMA(3), it says "These functions compute x * y + z."

So what's the difference between FMA and a naive code which does x * y + z? And why FMA has a better performance in most cases?

Patroclus
  • 1,163
  • 13
  • 31
  • When you refer to other materials in a question, do not say just “in FMA.” Provide a complete title and version number for a document and/or a link to it. That is necessary to give essential context to other people. We do not know what FMA description you are referring to. Something in an Intel architecture manual? Something in the the C standard? Some amateur web site? – Eric Postpischil Aug 22 '19 at 02:30
  • @EricPostpischil Sure I add it to the question – Patroclus Aug 22 '19 at 22:05

2 Answers2

3

[ I don't have enough karma to make a comment; adding another answer seems to be the only possibility. ]

Eric's answer covers everything well, but a caveat: there are times when using fma(a, b, c) in place of a*b+c can cause difficult to diagnose problems.

Consider

x = sqrt(a*a - b*b);

If it is replaced by

x = sqrt(fma(a, a, -b*b));

there are values of a and b for which the argument to the sqrt function may be negative even if |a|>=|b|. In particular, this will occur if |a|=|b| and the infinitely precise product a*a is less than the rounded value of a*a. This follows from the fact that the rounding error in computing a*a is given by fma(a, a, -a*a).

JM Arnold
  • 31
  • 1
  • 1
  • 3
2

a*b+c produces a result as if the computation were:

  • Calculate the infinitely precise product of a and b.
  • Round that product to the floating-point format being used.
  • Calculate the infinitely precise sum of that result and c.
  • Round that sum to the floating-point format being used.

fma(a, b, c) produces a result as if the computation were:

  • Calculate the infinitely precise product of a and b.
  • Calculate the infinitely precise sum of that product and c.
  • Round that sum to the floating-point format being used.

So it skips the step of rounding the intermediate product to the floating-pint format.

On a processor with an FMA instruction, a fused multiply-add may be faster because it is one floating-point instruction instead of two, and hardware engineers can often design the processor to do it efficiently. On a processor without an FMA instruction, a fused multiply-add may be slower because the software has to use extra instructions to maintain the information necessary to get the required result.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312