FMA intrinsics not working: is it Hardware or Compiler?

Question

I'm trying to use the Intel FMA intrinsics like _mm_fmadd_ps (__m128 a, __m128 b, __m128 c) in order to get better performance in my code.

So, first of all, i did a little test program to see what it can do and how can I possibly use them.

#include <stdio.h>
#include <stdlib.h>
#include "xmmintrin.h"

int main()
{
   __m128 v1,v2,v3,vr;
   v1 = _mm_set_ps (5.0, 5.0, 5.0, 5.0);
   v2 = _mm_set_ps (2.0, 2.0, 2.0, 2.0);
   v3 = _mm_set_ps (3.0, 3.0, 3.0, 3.0);

   vr = _mm_fmadd_ps (v1, v2, v3);
}

and i've got this error :

vr = error: incompatible types when assigning to type ‘__m128’ from type ‘int’ vr = _mm_fmadd_ps (v1, v2, v3);

I thought it was probably the processor capabilities is not allowing the use of such instructions so I looked on the internet for my processor model (Intel® Core™ i7-4700MQ Processor) and I found out that it supports only SSE4.1/4.2, AVX 2.0 intrinsics which was a little bit weird for me!! So I looked in the proc/cpuinfo file and the flags section I found the ** fma ** flag. This is the confusing part about the hardware.

As for the software, i've used this makefile option after some digging on the internet and I hope it's not the issue.

CC=gcc
CFLAGS=-g -c -Wall -O2 -mavx2 -mfma

And I'm using eclipse on a Ubuntu 12.04 LTS with a GCC version 4.9.4 Thank you.

You need [`#include `](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_fmadd_ps&expand=2389,2389). — Paul R, Jun 19 '17 at 12:27
That is a *compiler* error. The code hasn't even started running yet, so it cannot possibly be lack of support from your chip. — Cody Gray - on strike, Jun 19 '17 at 12:32
Note that this code does nothing useful, so when you compile it with optimizations enabled (`-O2`), the compiler elides all this code and simply emits code to return 0 from `main` ([demo](https://godbolt.org/g/YjDY6s)). So it'll run *real* fast. :-) — Cody Gray - on strike, Jun 19 '17 at 12:36
@CodyGray: true - making `vr` `volatile` fixes this though, if you just want to [see the generated code](https://godbolt.org/g/dm8Ptq). — Paul R, Jun 19 '17 at 13:05

Chuck Walbourn · Accepted Answer · 2022-09-13T16:31:18.733

4

One of the quirks of C is that the language indicates that the compiler is to assume a symbol it's not seen before must return int if you call it like a function. Since you didn't include the header that actually defines the signature for _mm_fmadd_ps, you get the strange error about converting int to __m128.

The original organization of the intrinsics headers was to have a unique header per instruction generations, so you had:

mmintrin.h     The original MMX instruction set (deprecated for x64 native)
mm3dnow.h      The AMD 3D Now! instruction set (deprecated for x64 native)
emmintrin.h    SSE (i.e. single-precision 4-wide SIMD)
xmmintrin.h    SSE2 (i.e. double-precision and integer 4-wide SIMD)

After that, they started using the code names of the processor architecture where the new instructions were introduced.

pmmintrin.h    SSE3 (the p stands for Prescott)
tmmintrin.h    Supplemental SSE3 (the t stands for Tejas)
smmintrin.h    SSE4.1 (not sure what the s is here for.
               They were added for Penryn but p
               was already used for Prescott)
nmmintrin.h    SSE4.2 (the n stands for Nehalem)
wmmintrin.h    AES (the w stands for Westmere)

These days the new instruction sets tend to end up in either ammintrin.h for AMD-originated stuff (ABM, BMI, LWP, TBM, XOP, FMA4, SSE4a, SSE5) or immintrin.h for Intel-originated stuff (AVX, FMA3, F16C, AVX2, etc.). AVX-512 is in zmmintrin.h.

The older system wasn't particularly intuitive, but neither is the new one. A number of AMD instruction subsets are defined in immintrin.h because they are the same instruction. Looking it up in the documentation or the header is really the only way to know which intrinsic is where.

For Intel this website is a good reference. Otherwise you need to see the developer guides for AMD and/or Intel.

You might find this blog series of mine useful.

edited Sep 13 '22 at 16:31

answered Jun 19 '17 at 17:14

Chuck Walbourn

38,259
2
58
81

the quirk does not apply to C++ and C99+ because [there are no implicit int in those standards](https://stackoverflow.com/q/434763/995714) – phuclv Jun 19 '17 at 17:19
1

Why not just include `x86intrin.h` and set your build flags accordingly? Saves a lot of time in having to remember all these crazy, unintuitive names. – Cody Gray - on strike Jun 19 '17 at 17:37
Generally speaking, you shouldn't use any intrinsic unless you know exactly what platform it's supported on. Otherwise, you can easily end up writing a program that doesn't run on as many systems as you think it should. Just including everything is probably overkill and you *still* need to look up each instruction to really understand it. – Chuck Walbourn Jun 20 '17 at 05:14
That's true. The issue is, I remember the instruction mnemonics and which instruction set they are part of, just not these `?mmintrin.h` names. I guess I could look those up, too, but I've never found a good reason to do so. – Cody Gray - on strike Jun 20 '17 at 10:45
@ChuckWalbourn, `x86intrin.h` is for GCC and Clang where you define the hardware at compile time e.g. with `-msse4.2` so you only get the intrinsics that you choose at compile time. `x86intrin.h` is the logical choice for these compilers. For MSVC you don't generally define the hardware (except VEX encoding and a few others) so it will let you use whatever intrinsics you include. – Z boson Oct 17 '17 at 09:24
For the record, **`immintrin.h` is the portable catch-all for Intel *SIMD* intrinsics and is what most SIMD code should be using**, instead of futzing with other specific headers. (It's the only portable way to use AVX or later intrinsics.) GCC/clang `x86intrin.h` vs. MSVC `intrin.h` have some scalar intrinsics like for BSF/BSR bit-scan. – Peter Cordes Sep 13 '22 at 16:47

score 1 · Answer 2 · answered Jun 20 '17 at 09:31

1

The -mfma might seem like a bit of a bother, but it's there for good reason. The result of

_mm_add_ps(_mm_mul_ps(a, b), c)
_mm_fmadd_ps(a, b, c)

Actually differ. If you are writing code that must compute the exact same results on all the machines you run the code on (determinism), then you will probably need to disable fma! That's basically why you need to enable it in the build with -fma.

Still, at least it's not as bad as the six compile flags you'll need for avx512 enabled SkyLake-X CPUs :(

answered Jun 20 '17 at 09:31

robthebloke

9,331
9
12

FYI, for Visual C++ use of FMA3 is implied by ``/arch:AVX2``. The compiler will not always emit a true FMA3 instruction even when you use an FMA3 instrinsic as the compiler determines proper codegen based on the use context. – Chuck Walbourn Jun 20 '17 at 18:08
What 6 compile flags for SKX? `-march=skylake-avx512` works just fine, and implies useful `-mtune=` settings. – Peter Cordes Sep 13 '22 at 16:46
I'm not sure this is true (anymore). Because at least gcc will replace `_mm_add_ps(_mm_mul_ps(a, b), c)` with `_mm_fmadd_ps(a, b, c)` on some platforms even without -fma enabled. – Björn Lindqvist Oct 03 '22 at 11:14

FMA intrinsics not working: is it Hardware or Compiler?

2 Answers2