Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.

Interesting Q&As / FAQs:

Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (__m256* types) require 32 bytes alignment, 512-bit memory operands (__m512* types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explains alignas, aligned_alloc, _aligned_malloc, C++17 aligned new, etc, and use of unaligned loadu / storeu intrinsics.
Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE works. , Includes in-lane vs. lane-crossing for AVX.
Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.

1252 questions

votes

1 answer

Why is this AVX code slower?

Updated: 19 Aug. 2017, 16:49 UTC I’m writing an AVX code to multiply a vector with 4 billion components by a constant, however, I see no difference between my small -- I hope -- optimized AVX code and the long scalar compiler optimized version. Both…

asked Aug 19 '17 at 14:20

Amanda Osvaldo

votes

0 answers

Error in initialising object with SIMD member using new keyword

I was trying to initialise a C++ object with an SIMD member with a new keyword. Here is my code: #include class simd_obj { protected: // float a; __m256 a; public: simd_obj(float f) { // a = f; a =…

c++ simd avx

asked Aug 16 '17 at 15:07

Firman

votes

2 answers

How to pack 16 16-bit registers/variables on AVX registers

I use inline assemble, my code like this: __m128i inl = _mm256_castsi256_si128(in); __m128i inh = _mm256_extractf128_si256(in, 1); __m128i outl, outh; __asm__( "vmovq %2, %%rax \n\t" "movzwl %%ax, %%ecx …

assembly x86 sse avx

asked Aug 11 '17 at 03:33

Bai

votes

1 answer

Does vmovd have avx-sse transition penalty?

(Assuming there are many avx instructions before and after movd)If I use vmovd to move data between general purpose registers and ymm registers, does it get slower because of using only 1 float value of ymm?

assembly avx

asked Aug 06 '17 at 13:42

huseyin tugrul buyukisik

11,469
4
45
97

votes

0 answers

Will Fortran code with LSODA as the main part benefit from AVX and AV512

Sorry if the question is naive or obvious: I am the user, not developer. The question is whether LSODA as implemented in ODEPACK (FORTRAN code) takes advantage of AVX option of Xeon processors, and how much performance improvement relative to no-AVX…

fortran simd avx

asked Aug 06 '17 at 04:17

user8423358

votes

1 answer

g++ 6.3, Kahan summation on avx intrinsics get serialized with volatile keyword

Using avx intrinsics and Kahan summation algorithm, I've tried this(just a part of "adder"): void add(const __m256 valuesToAdd) { volatile __m256 y = _mm256_sub_ps(valuesToAdd, accumulatedError); volatile __m256 t =…

c++ g++ volatile intrinsics avx

asked Aug 05 '17 at 23:20

huseyin tugrul buyukisik

11,469
4
45
97

votes

1 answer

How to align __m256d inside a struct?

Consider the following code: // Thin/POD struct struct Data { __m256d a; __m256d b; }; // Thick base class class Base { // ... }; // Thick derived class class Derived : public Base { Data data; // ... }; Is there a way to ensure that…

c++ struct alignment sse avx

asked Jul 02 '17 at 05:47

Serge Rogatch

13,865
7
86
158

votes

1 answer

How to detect a Xeon Phi (Knights Landing)

Intel engineers wrote that we should use VZEROUPPER/VZEROALL to avoid costly transition to non-VEX state on all processors, including future Xeon processor, but not on Xeon Phi: https://software.intel.com/pt-br/node/704023 People have also measured…

avx avx2 xeon-phi avx512 knights-landing

asked Jun 09 '17 at 20:12

Maxim Masiutin

3,991
4
55
72

votes

0 answers

Intel modular arithmetic using AVX or SSE

Is it possible to do modular arithmetic on integers with AVX or SSE? i.e. perform several mods all at once. Bipman

sse avx modular-arithmetic

asked Apr 28 '17 at 15:35

Bipman

votes

1 answer

Vector Scalar multiplication AVX segmentation fault on Mac OSX

Hi I am trying to write a code for Vector-Scalar multiplication using AVX on Sandy Bridge processor i7-3720QM (~2012). The code is a C code compiled with GNU gcc on Mac OSX 10.8. gcc -mavx -Wa,-q -o bb5 code1.c -lm I am getting Segmentation fault:…

c gcc vector avx

asked Jan 30 '17 at 08:45

Guddu

2,325
2
18
23

votes

0 answers

AVX support for remainder in G++ 5.4.0

I am writing a program using AVX with G++ 5.4.0, Ubuntu 16.04. Intel Intrinsics Guide( https://software.intel.com/sites/landingpage/IntrinsicsGuide/) said I can use _mm256_irem_epi32 in immintrin.h to compute element-wise remainder of two…

c++ g++ avx

asked Dec 19 '16 at 18:12

Harper

1,794
14
31

votes

1 answer

Matrix multiplication code running slower with AVX2

I am learning to program with AVX. So, I wrote a simple program to multiply matrices of size 4. While with no compiler optimizations, the AVX version is slightly faster than the non-AVX version, with O3 optimization, the non-AVX version becomes…

c++ c simd avx

asked Dec 10 '16 at 11:19

pythonic

20,589
43
136
219

votes

1 answer

AVX2 SIMD addition not working

I am trying to add this two vectors using AVX2 SIMD instruction. The code compiles with no error & warning, but crashes when run. Why? It should print the result of SIMD addition with AVX2 no matter how large the array is which is initialized in…

c++ sse simd avx avx2

asked Dec 10 '16 at 11:16

K.Malu

votes

0 answers

AVX2 Matrix 4x4 multiplication not working

This is a 4x4 matrix multiplication program using AVX2. But the program is not displaying the output. Please see where is the problem and do i have to do anything for memory alignment or not? Please suggest. #include #include…

c++ sse matrix-multiplication simd avx

asked Nov 20 '16 at 11:00

K.Malu

votes

1 answer

data alignment in structure and avx optimization

I'm trying to figure out what is the best (maybe avx?) optimization for this code typedef struct { float x; float y; } vector; vector add(vector u, vector v){ return (vector){u.x+v.x, u.y+v.y}; } running gcc -S code.c gives a quite long…

c floating-point avx

asked Nov 16 '16 at 09:21

Fabio

Prev 1 2 3

…

83 84 Next