7

I'm trying to test some of the Intel Intrinsics to see how they work. So, i created a function to do that for me and this is the code:

void test_intel_256()
{
__m256 res,vec1,vec2;

__M256_MM_SET_PS(vec1, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0);
__M256_MM_SET_PS(vec1, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0);

__M256_MM_ADD_PS(res,vec1,vec2);

if (res[0] ==9 && res[1] ==9 && res[2] ==9 && res[3] ==9 
  && res[4] ==9 && res[5] ==9 && res[6] ==9 && res[7] ==9 )
    printf("Addition : OK!\n");
else
    printf("Addition : FAILED!\n");
}

But then i'm getting these errors:

error: unknown type name ‘__m256’
error: subscripted value is neither array nor pointer nor vector
error: subscripted value is neither array nor pointer nor vector 
error: subscripted value is neither array nor pointer nor vector
error: subscripted value is neither array nor pointer nor vector
error: subscripted value is neither array nor pointer nor vector
error: subscripted value is neither array nor pointer nor vector
error: subscripted value is neither array nor pointer nor vector
error: subscripted value is neither array nor pointer nor vector
error: subscripted value is neither array nor pointer nor vector

Meaning that the compiler is not recognizing the __m256 type and by consequence he can't see the res as an array of floats. I'm including these libraries mmintrin.h, emmintrin.h, xmmintrin.h and i'm using eclipse Mars

So what i want to know is whether the problem is from the compiler or the hardware or something else? and how can i solve it? Thank you!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
A.nechi
  • 521
  • 1
  • 5
  • 15
  • Are you sure your CPU supports AVX? Which CPU are you using? – Daniel Margosian Jul 29 '16 at 15:27
  • @DanielMargosian: Even if their CPU doesn't support AVX, the compiler should still be able to compile it. (Cross compilation exists). – zindorsky Jul 29 '16 at 15:33
  • My CPU is **Intel® Core™ i7-4700MQ CPU @ 2.40GHz × 8** and it supports **SSE4.1/4.2, AVX 2.0** – A.nechi Jul 29 '16 at 15:38
  • @A.nechi Ok, what command are you using to compile? – Daniel Margosian Jul 29 '16 at 15:40
  • 1
    Are you using gcc? I have to specify -mavx2 on the command line to "enable" this (and include immintrin.h) – jcoder Jul 29 '16 at 15:42
  • Well i'm using the eclipse default configuration for compiling (gcc -O0 -g3 -Wall -c) and it worked for both __m64 and __m128 – A.nechi Jul 29 '16 at 15:42
  • How could it be a hardware problem when your code hasn't even compiled yet? You don't need AVX hardware to compile code targeting it. – Peter Cordes Jul 29 '16 at 16:05
  • 2
    Possible duplicate of [C++ Intrinsic not declared](http://stackoverflow.com/questions/17549630/c-intrinsic-not-declared). The answer to that question covers both parts of this: headers and `-mavx` – Peter Cordes Jul 29 '16 at 16:09

2 Answers2

13

MMX and SSE2 are baseline for x86-64, but AVX is not. You do need to specifically enable AVX, where you didn't for SSE2.

Build with -march=haswell or whatever CPU you actually have. Or just use -mavx.

Beware that gcc -mavx with the default tune=generic will split 256b loadu/storeu intrinsics into vmovups xmm / vinsertf128, which is bad if your data is actually aligned most of the time, and especially bad on Haswell with limited shuffle-port throughput.

It's good for Sandybridge and Bulldozer-family if your data really is unaligned, though. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568: it even affects AVX2 vector-integer code, even though all AVX2 CPUs (except maybe Excavator and Ryzen) are harmed by this tuning. tune=generic doesn't take into account what instruction-set extension are enabled, and there's no tune=generic-avx2.

You could use -mavx2 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store. That still doesn't enable other tuning options (like optimizing for macro-fusion of compare and branch) that all modern x86 CPUs have (except low-power ones), but that isn't enabled by gcc's tune=generic. (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78855).


Also:

I'm including these libraries mmintrin.h, emmintrin.h, xmmintrin.h

Don't do that. Always just include immintrin.h in SIMD code. It pulls in all Intel SSE/AVX extensions. This is why you get error: unknown type name ‘__m256’


Keep in mind that subscripting vector types lie __m256 is non-standard and non-portable. They're not arrays, and there's no reason you should expect [] to work like an array. Extracting the 3rd element or something from a SIMD vector in a register requires a shuffle instruction, not a load.


If you want handy wrappers for vector types that let you do stuff like use operator[] to extract scalars from elements of vector variables, have a look at Agner Fog's Vector Class Library. It's GPLed, so you'll have to look at other wrapper libraries if that's a problem.

It lets you do stuff like

// example from the manual for operator[]
Vec4i a(10,11,12,13);
int b = a[2];   // b = 12

You can use normal intrinsics on VCL types. Vec8f is a transparent wrapper on __m256, so you can use it with _mm256_mul_ps.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • "Keep in mind that subscripting vector types lie __m256 is non-standard and non-portable." In GCC at least they literally are arrays, they just have a special annotation to add some operations. documentation here: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html – Joseph Garvin May 14 '19 at 18:29
  • @JosephGarvin: Correct, it's portable and safe across gcc and clang, and maybe also ICC, assuming they continue to define `__m256` in terms of GNU C native vectors the same way. But it's *not* portable to MSVC, which defines Intel's intrinsic types as a union of `float m128_f32[4];` and some other members. (See [Is accessing bytes of a \_\_m128 variable via union legal?](//stackoverflow.com/q/15045132).) Intel's intrinsics API doesn't define that, unfortunately, so portable element access needs store/reload or a wrapper library. – Peter Cordes May 15 '19 at 07:01
1

try this out

res=_MM_ADD_PS(vec1,vec2); because the prototype of the __M256_MM_ADD_PS is

__m256 _MM_ADD_PS(__m256,__m256);

it takes two __m256 data types as the parameters and returns their sum as __m256 data, just like

int add(int , int);

for initializing

vec=_MM_setr_PS(7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0) or

vec =_MM_LOAD_PS(&arr) or

vec =_MM_LOAD_PS(ptr)

sekhar
  • 91
  • 1
  • 5