Questions tagged [fma]

Fused Multiply Add or Multiply-Accumulate

The Fused Multiply Add (also known as Multiply Accumulate) operation is when a multiplication followed by an addition or subtraction is done in a single operation with only one rounding at the end.

For example:

x = a * b + c

Would normally be done using two roundings without Fused-Multiply Add. (one after a * b and one after a * b + c)

Fused Multiply Add combines the two operations into a single operation thereby increasing accuracy in the computed result.

Supported Architectures include:

  • PowerPC
  • Intel x86 (via FMA3 instruction set)
  • AMD x86 (via FMA4 instruction set)
82 questions
4
votes
1 answer

clang/gcc only generates fma with -ffast-math; why?

On icc 19, a dot product compiles down to a loop over an fma instruction. On clang and gcc, the fma is only generated with -ffast-math. However, -ffast-math breaks IEEE compliance, but the fma is perfectly compliant with IEEE-754 2008, so if I have…
user14717
  • 4,757
  • 2
  • 44
  • 68
4
votes
1 answer

Weird optimization results for this multiply-add code

I'm compiling this code: #include template struct vec{ T v[4]; }; template vec foo (vec x, vec y, vec z) { return { x.v[0] + y.v[0] * z.v[0], x.v[1] + y.v[1] * z.v[1], …
einpoklum
  • 118,144
  • 57
  • 340
  • 684
4
votes
1 answer

Why is this code using VMULPD to write registers that will be overwritten by VFMADD? Isn't that useless?

While reviewing this piece of code, I noticed the following four instructions: vmulpd %ymm1,%ymm3,%ymm4 /* aim*bim */ vmulpd %ymm0,%ymm3,%ymm6 /* are*bim */ vfmadd231pd %ymm2,%ymm1,%ymm6 vfmsub231pd %ymm0,%ymm2,%ymm4 Now, if you consider that in…
Giulio Muscarello
  • 1,312
  • 2
  • 12
  • 33
4
votes
2 answers

FMA intrinsics not working: is it Hardware or Compiler?

I'm trying to use the Intel FMA intrinsics like _mm_fmadd_ps (__m128 a, __m128 b, __m128 c) in order to get better performance in my code. So, first of all, i did a little test program to see what it can do and how can I possibly use them. #include…
A.nechi
  • 521
  • 1
  • 5
  • 15
4
votes
3 answers

Generic way of handling fused-multiply-add floating-point inaccuracies

Yesterday I was tracking a bug in my project, which - after several hours - I've narrowed down to a piece of code which more or less was doing something like this: #include #include #include volatile float r =…
Freddie Chopin
  • 8,440
  • 2
  • 28
  • 58
4
votes
2 answers

Intel FMA Instructions Offer Zero Performance Advantage

Consider the following instruction sequence using Haswell's FMA instructions: __m256 r1 = _mm256_xor_ps (r1, r1); r1 = _mm256_fmadd_ps (rp1, m6, r1); r1 = _mm256_fmadd_ps (rp2, m7, r1); r1 = _mm256_fmadd_ps (rp3, m8, r1); __m256 r2 =…
rohitsan
  • 1,001
  • 8
  • 31
4
votes
1 answer

How to chain multiple fma operations together for performance?

Assuming that in some C or C++ code I have a function named T fma( T a, T b, T c ) that performs 1 multiplication and 1 addition like so ( a * b ) + c ; how I'm supposed to optimize multiple mul & add steps ? For example my algorithm needs to be…
user2485710
  • 9,451
  • 13
  • 58
  • 102
4
votes
2 answers

Z3: Floating point FMA semantics

Z3 returns a satisfying model for this benchmark: http://rise4fun.com/Z3/Bnv5m However, the query is essentially asserting that a*b+0 is equivalent to a*b using the FMA instruction, which I believe holds for IEEE floating point numbers. Note that…
alias
  • 28,120
  • 2
  • 23
  • 40
4
votes
3 answers

Where can I find a reference for the AMD FMA 4 intrinsics?

I am trying to modify a piece of code that uses SSE (128bit) calls to use the 256bit FMA feature on the Bulldozer Opteron. I cant seem to find the intrinsics for these calls. Some questions on this forum have used these intrinsics (ex: How to find…
powerrox
  • 1,334
  • 11
  • 21
3
votes
1 answer

Does VS2010 SP1 support only part of the AVX instruction set?

Microsoft states VS2010 supports the full set of AVX instructions: http://blogs.msdn.com/b/vcblog/archive/2009/11/02/visual-c-code-generation-in-visual-studio-2010.aspx ... In VS2010 release, all AVX features and instructions are fully supported via…
Mike
  • 1,717
  • 2
  • 15
  • 19
3
votes
0 answers

Fast fixed-size polynomial evaluation: MSVC vs GCC

I need to implement fast bivariate polynomial evaluation (for a polynomial whose size is fixed at compile time). I came up with the following example program: #include #include #include int main() { constexpr size_t…
pem
  • 365
  • 2
  • 12
3
votes
0 answers

Why Fma code is performing worse than Avx?

I am writing basic linear algebra subprogram (BLAS) library. There is one issue with performance of fma code. using System; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; namespace LinearAlgebra { public static class…
3
votes
1 answer

How to refine floating-point division on FMA-capable GPUs?

When writing computational code for GPUs using APIs where compute shaders are translated via SPIR-V (in particular, Vulkan), I am guaranteed that ULP error of floating-point division will be at most 3. Other basic arithmetic (addition,…
amonakov
  • 2,324
  • 11
  • 23
3
votes
3 answers

More aggresive optimization for FMA operations

I want to build a datatype that represents multiple (say N) arithmetic types and provides the same interface as an arithmetic type using operator overloading, such that I get a datatype like Agner Fog's vectorclass. Please look at this example:…
Nils
  • 31
  • 3
3
votes
1 answer

How advantageous is using fused multiply-accumulate for double-precision?

I am trying to understand if is advantageous using std::fma with double arguments by looking at the assembly code that is generated, I am using the flag "-O3", and I am comparing the assembly for this two routines: #include #define…
user3116936
  • 492
  • 3
  • 21