Questions tagged [fma]

Fused Multiply Add or Multiply-Accumulate

The Fused Multiply Add (also known as Multiply Accumulate) operation is when a multiplication followed by an addition or subtraction is done in a single operation with only one rounding at the end.

For example:

x = a * b + c

Would normally be done using two roundings without Fused-Multiply Add. (one after a * b and one after a * b + c)

Fused Multiply Add combines the two operations into a single operation thereby increasing accuracy in the computed result.

Supported Architectures include:

  • PowerPC
  • Intel x86 (via FMA3 instruction set)
  • AMD x86 (via FMA4 instruction set)
82 questions
3
votes
2 answers

What do I need to do so GCC 4.9 recognizes the opportunity to use AVX FMA?

I have std::vector X,Y both of size N (with N%16==0) and I want to calculate sum(X[i]*Y[i]). That's a classical use case for Fused Multiply and Add (FMA), which should be fast on AVX-capable processors. I know all my target CPU's are Intel,…
MSalters
  • 173,980
  • 10
  • 155
  • 350
3
votes
1 answer

Haswell FMA Instructions Generating Denormals

I am using the Intel Haswell CPU's FMA instructions to optimize some computation. However, I discovered that those instructions are generating denormals even if I set the MXCSR register to DNZ and FTZ mode. How can I force those FMA instructions to…
rohitsan
  • 1,001
  • 8
  • 31
3
votes
0 answers

Should I use FMA explicitly in C++AMP for GPU kernels?

For example, I have an expression like a = b * c + d * e + f * g + h * i + j. Should I instead write a = fma(a, c, fma(d, e, fma(f, g, fma(h, i, j))))? Will compiler automatically optimize the expression? Or is the fma form actually better than the…
BlueWanderer
  • 2,671
  • 2
  • 21
  • 36
3
votes
2 answers

Using FMA (fused multiply) instructions for complex multiplication

I'd like to leverage available fused multiply add/subtract CPU instructions to assist in complex multiplication over a decently sized array. Essentially, the basic math looks like so: void ComplexMultiplyAddToArray(float* pDstR, float* pDstI, const…
Kumputer
  • 588
  • 1
  • 6
  • 22
3
votes
1 answer

Accurate method to calculate double FMA and Shared memory latency

I am trying to come up with an accurate way to measure the latency of two operations: Latency of a double precision FMA operation. Latency of a double precision load from shared memory. I am using a K20x and was wondering if this code would give…
Christian Sarofeen
  • 2,202
  • 11
  • 18
2
votes
1 answer

How should I implement a generic FMA/FMAF instruction in software?

FMA is a fused multiply-add instruction. The fmaf (float x, float y, float z) function in glibc calls the vfmadd213ss instruction. I want to know how this instruction is implemented. According to my understanding: add the exponents of x and y…
xiaohuihui
  • 45
  • 4
2
votes
1 answer

CUDA half float operations without explicit intrinsics

I am using CUDA 11.2 and I use the __half type to do operations on 16 bit floating point values. I am surprised that the nvcc compiler will not properly invoke fused multiply add instructions when I do: __half a,b,c; ... __half x = a * b +…
Bram
  • 7,440
  • 3
  • 52
  • 94
2
votes
2 answers

Understanding FMA performance

I would like to understand how to compute FMA performance. If we look into the description here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA for Skylake architecture the instruction…
no one special
  • 1,608
  • 13
  • 32
2
votes
1 answer

Throughput FMA and multiplication on X86 Broadwell

I am suspecting last Intel architecture to perform the mnemonic MUL like a FMA but with a null addition (on broadWell architecture). In details, I am currently performing product of Quatric polynomials (Pi), following the pattern. P1*P2*P3*P4…
Timocafé
  • 765
  • 6
  • 18
2
votes
1 answer

FMA performance compared to naive calculation

I'm trying to compare FMA performance (fma() in math.h) versus naive multiplication and addition in floating point computing. Test is simple. I am going to iterate same calculation for large iteration number. There are two things I have to achieve…
Jongbin Park
  • 659
  • 6
  • 14
2
votes
1 answer

Is there a simple way to use multiply accumulate in c++?

I've gotten great performance benefit from using the mad function in the c++AMP library. I was wondering if there is a similar function for regular c++ 11? All I found googling was stuff on AVX intrinsics but I'd rather avoid them due to them not…
user81993
  • 6,167
  • 6
  • 32
  • 64
2
votes
1 answer

Converting from floating-point to decimal with floating-point computations

I am trying to convert a floating-point double-precision value x to decimal with 12 (correctly rounded) significant digits. I am assuming that x is between 10^110 and 10^111 such that its decimal representation will be of the form x.xxxxxxxxxxxE110.…
Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
1
vote
2 answers

Multiply-add `a = a*2 + b` instruction on CPU?

The classical Multiply-Accumulate operation is a = a + b*c. But I currently wonder if there exist an instruction that allows to do the following operations on integer in 1 clock cycle: (a and b are unsigned 64-bit integers: unsigned long long int) a…
Vincent
  • 57,703
  • 61
  • 205
  • 388
1
vote
0 answers

v4fmaddps instructions for packed 32-bit integers

For example, v4fmaddps are instructions for packed single-precision (32-bit) floating-point elements, but I want to multiply accumulate 32-bit integer. Can I use v4fmaddps and input packed 32-bit integers. Does this change the computation results?
anna
  • 39
  • 3
1
vote
0 answers

Deleteing initialization leads to avx2 fma performance drop. Why?

I put a link here: https://godbolt.org/z/d6bx9vh1s. You can freely browse, edit and check speed. I wrote a piece of code to test AVX2 FMA's maximum speed. But, I found that deleting the xor section leads to a huge performance drop (from 100+ GFLOPs…
tigertang
  • 445
  • 1
  • 6
  • 18