Questions tagged [fma]

Fused Multiply Add or Multiply-Accumulate

The Fused Multiply Add (also known as Multiply Accumulate) operation is when a multiplication followed by an addition or subtraction is done in a single operation with only one rounding at the end.

For example:

x = a * b + c

Would normally be done using two roundings without Fused-Multiply Add. (one after a * b and one after a * b + c)

Fused Multiply Add combines the two operations into a single operation thereby increasing accuracy in the computed result.

Supported Architectures include:

PowerPC
Intel x86 (via FMA3 instruction set)
AMD x86 (via FMA4 instruction set)

82 questions

votes

1 answer

clang/gcc only generates fma with -ffast-math; why?

On icc 19, a dot product compiles down to a loop over an fma instruction. On clang and gcc, the fma is only generated with -ffast-math. However, -ffast-math breaks IEEE compliance, but the fma is perfectly compliant with IEEE-754 2008, so if I have…

floating-point dot-product fma

asked May 03 '19 at 16:45

user14717

4,757
2
44
68

votes

1 answer

Weird optimization results for this multiply-add code

I'm compiling this code: #include template struct vec{ T v[4]; }; template vec foo (vec x, vec y, vec z) { return { x.v[0] + y.v[0] * z.v[0], x.v[1] + y.v[1] * z.v[1], …

gcc clang compiler-optimization simd fma

asked Mar 13 '18 at 10:24

einpoklum

118,144
57
340
684

votes

1 answer

Why is this code using VMULPD to write registers that will be overwritten by VFMADD? Isn't that useless?

While reviewing this piece of code, I noticed the following four instructions: vmulpd %ymm1,%ymm3,%ymm4 /* aim*bim */ vmulpd %ymm0,%ymm3,%ymm6 /* are*bim */ vfmadd231pd %ymm2,%ymm1,%ymm6 vfmsub231pd %ymm0,%ymm2,%ymm4 Now, if you consider that in…

assembly avx fma

asked Dec 10 '17 at 13:43

Giulio Muscarello

1,312
2
12
33

votes

2 answers

FMA intrinsics not working: is it Hardware or Compiler?

I'm trying to use the Intel FMA intrinsics like _mm_fmadd_ps (__m128 a, __m128 b, __m128 c) in order to get better performance in my code. So, first of all, i did a little test program to see what it can do and how can I possibly use them. #include…

c x86 simd intrinsics fma

asked Jun 19 '17 at 12:11

A.nechi

votes

3 answers

Generic way of handling fused-multiply-add floating-point inaccuracies

Yesterday I was tracking a bug in my project, which - after several hours - I've narrowed down to a piece of code which more or less was doing something like this: #include #include #include volatile float r =…

c++ floating-point precision floating-accuracy fma

asked Feb 09 '17 at 09:04

Freddie Chopin

8,440
2
28
58

votes

2 answers

Intel FMA Instructions Offer Zero Performance Advantage

Consider the following instruction sequence using Haswell's FMA instructions: __m256 r1 = _mm256_xor_ps (r1, r1); r1 = _mm256_fmadd_ps (rp1, m6, r1); r1 = _mm256_fmadd_ps (rp2, m7, r1); r1 = _mm256_fmadd_ps (rp3, m8, r1); __m256 r2 =…

c assembly avx2 fma

asked Feb 25 '16 at 19:51

rohitsan

1,001
8
31

votes

1 answer

How to chain multiple fma operations together for performance?

Assuming that in some C or C++ code I have a function named T fma( T a, T b, T c ) that performs 1 multiplication and 1 addition like so ( a * b ) + c ; how I'm supposed to optimize multiple mul & add steps ? For example my algorithm needs to be…

c++ c floating-point fma

asked May 17 '14 at 10:43

user2485710

9,451
13
58
102

votes

2 answers

Z3: Floating point FMA semantics

Z3 returns a satisfying model for this benchmark: http://rise4fun.com/Z3/Bnv5m However, the query is essentially asserting that a*b+0 is equivalent to a*b using the FMA instruction, which I believe holds for IEEE floating point numbers. Note that…

floating-point z3 ieee-754 fma

asked Apr 08 '13 at 03:12

alias

28,120
2
23
40

votes

3 answers

Where can I find a reference for the AMD FMA 4 intrinsics?

I am trying to modify a piece of code that uses SSE (128bit) calls to use the 256bit FMA feature on the Bulldozer Opteron. I cant seem to find the intrinsics for these calls. Some questions on this forum have used these intrinsics (ex: How to find…

sse simd avx amd-processor fma

asked Apr 05 '12 at 17:59

powerrox

1,334
11
21

votes

1 answer

Does VS2010 SP1 support only part of the AVX instruction set?

Microsoft states VS2010 supports the full set of AVX instructions: http://blogs.msdn.com/b/vcblog/archive/2009/11/02/visual-c-code-generation-in-visual-studio-2010.aspx ... In VS2010 release, all AVX features and instructions are fully supported via…

c++ visual-studio-2010 sse avx fma

asked Oct 20 '11 at 20:01

Mike

1,717
2
15
19

votes

0 answers

Fast fixed-size polynomial evaluation: MSVC vs GCC

I need to implement fast bivariate polynomial evaluation (for a polynomial whose size is fixed at compile time). I came up with the following example program: #include #include #include int main() { constexpr size_t…

c++ gcc visual-c++ compiler-optimization fma

asked Aug 16 '22 at 10:03

pem

votes

0 answers

Why Fma code is performing worse than Avx?

I am writing basic linear algebra subprogram (BLAS) library. There is one issue with performance of fma code. using System; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; namespace LinearAlgebra { public static class…

c# benchmarking avx fma

asked Oct 08 '21 at 11:53

Станислав Герасименко

votes

1 answer

How to refine floating-point division on FMA-capable GPUs?

When writing computational code for GPUs using APIs where compute shaders are translated via SPIR-V (in particular, Vulkan), I am guaranteed that ULP error of floating-point division will be at most 3. Other basic arithmetic (addition,…

math floating-point gpu division fma

asked Dec 24 '20 at 13:25

amonakov

2,324
11
23

votes

3 answers

More aggresive optimization for FMA operations

I want to build a datatype that represents multiple (say N) arithmetic types and provides the same interface as an arithmetic type using operator overloading, such that I get a datatype like Agner Fog's vectorclass. Please look at this example:…

c++ gcc clang fma

asked Nov 04 '20 at 14:47

Nils

votes

1 answer

How advantageous is using fused multiply-accumulate for double-precision?

I am trying to understand if is advantageous using std::fma with double arguments by looking at the assembly code that is generated, I am using the flag "-O3", and I am comparing the assembly for this two routines: #include #define…

c++ performance assembly x86-64 fma

asked Jun 09 '20 at 00:36

user3116936

Prev 1 2

4 5 6 Next