Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.


Interesting Q&As / FAQs:

1252 questions
0
votes
1 answer

Why is this AVX code slower?

Updated: 19 Aug. 2017, 16:49 UTC I’m writing an AVX code to multiply a vector with 4 billion components by a constant, however, I see no difference between my small -- I hope -- optimized AVX code and the long scalar compiler optimized version. Both…
0
votes
0 answers

Error in initialising object with SIMD member using new keyword

I was trying to initialise a C++ object with an SIMD member with a new keyword. Here is my code: #include class simd_obj { protected: // float a; __m256 a; public: simd_obj(float f) { // a = f; a =…
Firman
  • 928
  • 7
  • 15
0
votes
2 answers

How to pack 16 16-bit registers/variables on AVX registers

I use inline assemble, my code like this: __m128i inl = _mm256_castsi256_si128(in); __m128i inh = _mm256_extractf128_si256(in, 1); __m128i outl, outh; __asm__( "vmovq %2, %%rax \n\t" "movzwl %%ax, %%ecx …
Bai
  • 115
  • 7
0
votes
1 answer

Does vmovd have avx-sse transition penalty?

(Assuming there are many avx instructions before and after movd)If I use vmovd to move data between general purpose registers and ymm registers, does it get slower because of using only 1 float value of ymm?
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
0
votes
0 answers

Will Fortran code with LSODA as the main part benefit from AVX and AV512

Sorry if the question is naive or obvious: I am the user, not developer. The question is whether LSODA as implemented in ODEPACK (FORTRAN code) takes advantage of AVX option of Xeon processors, and how much performance improvement relative to no-AVX…
0
votes
1 answer

g++ 6.3, Kahan summation on avx intrinsics get serialized with volatile keyword

Using avx intrinsics and Kahan summation algorithm, I've tried this(just a part of "adder"): void add(const __m256 valuesToAdd) { volatile __m256 y = _mm256_sub_ps(valuesToAdd, accumulatedError); volatile __m256 t =…
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
0
votes
1 answer

How to align __m256d inside a struct?

Consider the following code: // Thin/POD struct struct Data { __m256d a; __m256d b; }; // Thick base class class Base { // ... }; // Thick derived class class Derived : public Base { Data data; // ... }; Is there a way to ensure that…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
0
votes
1 answer

How to detect a Xeon Phi (Knights Landing)

Intel engineers wrote that we should use VZEROUPPER/VZEROALL to avoid costly transition to non-VEX state on all processors, including future Xeon processor, but not on Xeon Phi: https://software.intel.com/pt-br/node/704023 People have also measured…
Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
0
votes
0 answers

Intel modular arithmetic using AVX or SSE

Is it possible to do modular arithmetic on integers with AVX or SSE? i.e. perform several mods all at once. Bipman
Bipman
  • 53
  • 8
0
votes
1 answer

Vector Scalar multiplication AVX segmentation fault on Mac OSX

Hi I am trying to write a code for Vector-Scalar multiplication using AVX on Sandy Bridge processor i7-3720QM (~2012). The code is a C code compiled with GNU gcc on Mac OSX 10.8. gcc -mavx -Wa,-q -o bb5 code1.c -lm I am getting Segmentation fault:…
Guddu
  • 2,325
  • 2
  • 18
  • 23
0
votes
0 answers

AVX support for remainder in G++ 5.4.0

I am writing a program using AVX with G++ 5.4.0, Ubuntu 16.04. Intel Intrinsics Guide( https://software.intel.com/sites/landingpage/IntrinsicsGuide/) said I can use _mm256_irem_epi32 in immintrin.h to compute element-wise remainder of two…
Harper
  • 1,794
  • 14
  • 31
0
votes
1 answer

Matrix multiplication code running slower with AVX2

I am learning to program with AVX. So, I wrote a simple program to multiply matrices of size 4. While with no compiler optimizations, the AVX version is slightly faster than the non-AVX version, with O3 optimization, the non-AVX version becomes…
pythonic
  • 20,589
  • 43
  • 136
  • 219
0
votes
1 answer

AVX2 SIMD addition not working

I am trying to add this two vectors using AVX2 SIMD instruction. The code compiles with no error & warning, but crashes when run. Why? It should print the result of SIMD addition with AVX2 no matter how large the array is which is initialized in…
K.Malu
  • 11
  • 10
0
votes
0 answers

AVX2 Matrix 4x4 multiplication not working

This is a 4x4 matrix multiplication program using AVX2. But the program is not displaying the output. Please see where is the problem and do i have to do anything for memory alignment or not? Please suggest. #include #include…
K.Malu
  • 11
  • 10
0
votes
1 answer

data alignment in structure and avx optimization

I'm trying to figure out what is the best (maybe avx?) optimization for this code typedef struct { float x; float y; } vector; vector add(vector u, vector v){ return (vector){u.x+v.x, u.y+v.y}; } running gcc -S code.c gives a quite long…
Fabio
  • 211
  • 1
  • 8