Accumulating vector in __m128 using _mm_hadd_ps producing compile time error

Question

I have recently worked with C intrinsics in order to make my code faster, especially SIMD implementations. I pose the following problem: Given a __m128 acc which holds 4 floats, I want to accumulate them into a single float ac.

acc = _mm_hadd_ps(acc,acc);
acc = _mm_hadd_ps(acc,acc);
ac = _mm_cvtss_f32(acc);

does however not compile, even though the functionality of calling _mm_hadd_ps() twice does match my goal. The compiler does produce the following output:

Compiler command failed with code 1
Command: gcc -pipe -std=c17 -g -gdwarf-4 -O3 -Wall -Wextra -Wpedantic -pedantic-errors -ffreestanding -nostdlib -static -march=k8 -mtune=generic -mno-80387 -mno-mmx -D_MM_MALLOC_H_INCLUDED -Wa,-march=k8+cmov+nommx -Wa,-mx86-used-note=no -Wa,--fatal-warnings -Wl,-n -Wl,--fatal-warnings -Wl,--no-dynamic-linker -Wl,--build-id=none -Wl,-z,defs -Wl,-z,noexecstack -Wl,-z,norelro -Wl,-z,noseparate-code -Wl,-Ttext=0x507340f44000 -Wl,-e,sdot -include defs.inc -o user.elf user.c
In file included from /usr/lib/gcc/x86_64-alpine-linux-musl/10.2.1/include/immintrin.h:33,
                 from user.c:2:
user.c: In function 'sdot':
/usr/lib/gcc/x86_64-alpine-linux-musl/10.2.1/include/pmmintrin.h:56:1: error: inlining failed in call to 'always_inline' '_mm_hadd_ps': target specific option mismatch
      | _mm_hadd_ps (__m128 __X, __m128 __Y)
      | ^~~~~~~~~~~
user.c:19:16: note: called from here
   19 |  __m128 acc3 = _mm_hadd_ps(acc2,acc2);
      |                ^~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-alpine-linux-musl/10.2.1/include/immintrin.h:33,
                 from user.c:2:
/usr/lib/gcc/x86_64-alpine-linux-musl/10.2.1/include/pmmintrin.h:56:1: error: inlining failed in call to 'always_inline' '_mm_hadd_ps': target specific option mismatch
   56 | _mm_hadd_ps (__m128 __X, __m128 __Y)
      | ^~~~~~~~~~~
user.c:17:18: note: called from here
   17 |    __m128 acc2 = _mm_hadd_ps(acc1,acc1);
      |                  ^~~~~~~~~~~~~~~~~~~~~~

What does it mean for a function to have to be inlined, and where in the code do I violate said restriction?

score 1 · Accepted Answer · answered May 27 '23 at 17:19

In your (excessively long for a Stack Overflow question) gcc command line, I see -march=k8. Looking at the relevant gcc documentation:

Processors based on the AMD K8 core with x86-64 instruction set support, including the AMD Opteron, Athlon 64, and Athlon 64 FX processors. (This supersets MMX, SSE, SSE2, 3DNow!, enhanced 3DNow! and 64-bit instruction set extensions.)

However, the HADDPS instruction/_mm_hadd_ps() instrinsic were introduced in SSE3. So you need to tell gcc to enable that set of instructions too, with -msse3 (And, of course, the CPU you're running this one has to be recent enough to support SSE3; if it's too old, it won't; later ones in the series do). I also see a -march=k8-sse3 option that would work too instead of plain k8.

For the problem you're trying to solve of summing up a SSE vector of floats, though, there are better approaches than two hadd's. See Peter Corde's excellent write-up of different ways of doing it.

Thought about flagging it as a dupe of the SSE hsum question linked to in the answer, but the underlying issue is the more general "using an intrinsic not in the currently enabled ones"... — Shawn, May 27 '23 at 17:24
Yeah, this isn't a duplicate of the efficient hsum question. It is basically a duplicate of some existing Q&As about needing to enable ISA options to compile intrinsics with GCC or clang, though. (The efficient-hsum Q&A is kind of a bonus I might add into the duplicate list, though. The most efficient hsum without AVX does require SSE3 for `movshdup`, costing one less `movaps` than the next best option which only requires SSE1.) — Peter Cordes, May 27 '23 at 18:29

Accumulating vector in __m128 using _mm_hadd_ps producing compile time error

1 Answers1