Questions tagged [fma]

Fused Multiply Add or Multiply-Accumulate

The Fused Multiply Add (also known as Multiply Accumulate) operation is when a multiplication followed by an addition or subtraction is done in a single operation with only one rounding at the end.

For example:

x = a * b + c

Would normally be done using two roundings without Fused-Multiply Add. (one after a * b and one after a * b + c)

Fused Multiply Add combines the two operations into a single operation thereby increasing accuracy in the computed result.

Supported Architectures include:

PowerPC
Intel x86 (via FMA3 instruction set)
AMD x86 (via FMA4 instruction set)

82 questions

votes

2 answers

What do I need to do so GCC 4.9 recognizes the opportunity to use AVX FMA?

I have std::vector X,Y both of size N (with N%16==0) and I want to calculate sum(X[i]*Y[i]). That's a classical use case for Fused Multiply and Add (FMA), which should be fast on AVX-capable processors. I know all my target CPU's are Intel,…

asked Feb 16 '16 at 17:32

MSalters

173,980
10
155
350

votes

1 answer

Haswell FMA Instructions Generating Denormals

I am using the Intel Haswell CPU's FMA instructions to optimize some computation. However, I discovered that those instructions are generating denormals even if I set the MXCSR register to DNZ and FTZ mode. How can I force those FMA instructions to…

simd fma

asked Sep 18 '15 at 17:00

rohitsan

1,001
8
31

votes

0 answers

Should I use FMA explicitly in C++AMP for GPU kernels?

For example, I have an expression like a = b * c + d * e + f * g + h * i + j. Should I instead write a = fma(a, c, fma(d, e, fma(f, g, fma(h, i, j))))? Will compiler automatically optimize the expression? Or is the fma form actually better than the…

c++-amp fma

asked Sep 01 '15 at 13:28

BlueWanderer

2,671
2
21
36

votes

2 answers

Using FMA (fused multiply) instructions for complex multiplication

I'd like to leverage available fused multiply add/subtract CPU instructions to assist in complex multiplication over a decently sized array. Essentially, the basic math looks like so: void ComplexMultiplyAddToArray(float* pDstR, float* pDstI, const…

c++ floating-point fma

asked May 07 '15 at 00:20

Kumputer

votes

1 answer

Accurate method to calculate double FMA and Shared memory latency

I am trying to come up with an accurate way to measure the latency of two operations: Latency of a double precision FMA operation. Latency of a double precision load from shared memory. I am using a K20x and was wondering if this code would give…

cuda latency microbenchmark gpu-shared-memory fma

asked Jan 11 '15 at 17:38

Christian Sarofeen

2,202
11
18

votes

1 answer

How should I implement a generic FMA/FMAF instruction in software?

FMA is a fused multiply-add instruction. The fmaf (float x, float y, float z) function in glibc calls the vfmadd213ss instruction. I want to know how this instruction is implemented. According to my understanding: add the exponents of x and y…

math floating-point fma

asked Sep 23 '22 at 13:22

xiaohuihui

votes

1 answer

CUDA half float operations without explicit intrinsics

I am using CUDA 11.2 and I use the __half type to do operations on 16 bit floating point values. I am surprised that the nvcc compiler will not properly invoke fused multiply add instructions when I do: __half a,b,c; ... __half x = a * b +…

cuda intrinsics nvcc fma half-precision-float

asked Jan 07 '21 at 19:46

Bram

7,440
3
52
94

votes

2 answers

Understanding FMA performance

I would like to understand how to compute FMA performance. If we look into the description here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA for Skylake architecture the instruction…

c++ x86 fma

asked Mar 03 '19 at 16:26

no one special

1,608
13
32

votes

1 answer

Throughput FMA and multiplication on X86 Broadwell

I am suspecting last Intel architecture to perform the mnemonic MUL like a FMA but with a null addition (on broadWell architecture). In details, I am currently performing product of Quatric polynomials (Pi), following the pattern. P1*P2*P3*P4…

performance assembly x86 fpu fma

asked Feb 12 '19 at 21:56

Timocafé

votes

1 answer

FMA performance compared to naive calculation

I'm trying to compare FMA performance (fma() in math.h) versus naive multiplication and addition in floating point computing. Test is simple. I am going to iterate same calculation for large iteration number. There are two things I have to achieve…

c++ fma

asked Mar 23 '15 at 19:37

Jongbin Park

votes

1 answer

Is there a simple way to use multiply accumulate in c++?

I've gotten great performance benefit from using the mad function in the c++AMP library. I was wondering if there is a similar function for regular c++ 11? All I found googling was stuff on AVX intrinsics but I'd rather avoid them due to them not…

c++ fma

asked Mar 04 '15 at 13:10

user81993

6,167
6
32
64

votes

1 answer

Converting from floating-point to decimal with floating-point computations

I am trying to convert a floating-point double-precision value x to decimal with 12 (correctly rounded) significant digits. I am assuming that x is between 10^110 and 10^111 such that its decimal representation will be of the form x.xxxxxxxxxxxE110.…

floating-point ieee-754 fma

asked Jul 17 '13 at 21:08

Pascal Cuoq

79,187
7
161
281

vote

2 answers

Multiply-add `a = a*2 + b` instruction on CPU?

The classical Multiply-Accumulate operation is a = a + b*c. But I currently wonder if there exist an instruction that allows to do the following operations on integer in 1 clock cycle: (a and b are unsigned 64-bit integers: unsigned long long int) a…

c assembly x86 multiplication fma

asked Feb 11 '12 at 16:51

Vincent

57,703
61
205
388

vote

0 answers

v4fmaddps instructions for packed 32-bit integers

For example, v4fmaddps are instructions for packed single-precision (32-bit) floating-point elements, but I want to multiply accumulate 32-bit integer. Can I use v4fmaddps and input packed 32-bit integers. Does this change the computation results?

floating-point avx512 fma

asked Nov 28 '22 at 03:42

anna

vote

0 answers

Deleteing initialization leads to avx2 fma performance drop. Why?

I put a link here: https://godbolt.org/z/d6bx9vh1s. You can freely browse, edit and check speed. I wrote a piece of code to test AVX2 FMA's maximum speed. But, I found that deleting the xor section leads to a huge performance drop (from 100+ GFLOPs…

c++ x86 cpu avx2 fma

asked May 30 '22 at 11:17

tigertang

Prev 1 2 3

5 6 Next