Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.

Interesting Q&As / FAQs:

Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (__m256* types) require 32 bytes alignment, 512-bit memory operands (__m512* types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explains alignas, aligned_alloc, _aligned_malloc, C++17 aligned new, etc, and use of unaligned loadu / storeu intrinsics.
Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE works. , Includes in-lane vs. lane-crossing for AVX.
Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.

1252 questions

votes

0 answers

SSE/AVX instructions to accellerate the expression u32 = (z << 16) | (y << 8) | x

I have 3 unsigned ints with range [0, 255]. I want to store these 3 numbers to a compact storage and since this operation happens too often I want to know how I can improve it. Initially I tried this: struct Foo { uint8_t x; uint8_t y; …

c sse simd avx

asked Nov 07 '16 at 15:37

Pan. Christopoulos Charitos

votes

0 answers

identifier "intrinsic function" is undefined

when I compiled my intel AVX code using intrinsic functions and intel compiler 2016 in visual studio C++ 2015 that error appears for all intrinsics: for example: identifier"_mm256_broadcast_ss" is undefined. this sample of my code: …

c++ intel avx

asked Oct 18 '16 at 16:47

semsem

votes

1 answer

How to disable AVX instructions in OpenSSL?

I've got a problem with running iOS application created with RoboVM framework. Probably this is caused by my processor which is not supporting AVX instruction. I found the page: https://www.openssl.org/docs/manmaster/crypto/OPENSSL_ia32cap.html with…

ios openssl avx robovm

asked Oct 17 '16 at 17:00

Blady214

votes

2 answers

Does the bitwise operation (&, ^. | etc) provided as operator overloads in the std::bitset use AVX or SSE4 instructions?

Since this is implementation dependent, is the only way to find that out is through the disassembly?

c++ stl simd avx avx2

asked Oct 05 '16 at 22:19

VINAY PALAKKODE

votes

0 answers

Demultiplex an AVX register into four registers each containing identical values

I have an array double x[4] of four doubles stored contiguously in memory. What would be the fastest (in terms of efficient) way using the AVX instruction set to prepare four registers, say, ymm0,ymm1,ymm2,ymm3 such that : ymm0 = { x[0], x[0], x[0],…

c simd intrinsics avx

asked Sep 02 '16 at 16:04

Tomas

votes

0 answers

SIMD (AVX2) mask store and pack

I am trying to perform the following operation in AVX2 code (dest, data, and mask are int32 pointers): int j=0; for(i=0; i

c sse simd avx avx2

asked Aug 16 '16 at 23:07

nineties

votes

0 answers

SSE Efficient signed short convolution

I am trying to implement fixed point 7X7 convolution on large signed short images (1000X1000). The (float) kernel is scaled up (by 1<<14) to get valid results, and the final results are scaled down back. I am implementing it using SSE. Working on…

optimization sse convolution avx intel-ipp

asked Aug 04 '16 at 09:29

user1014366

votes

1 answer

AVX, Horizontal Sum of Single Precision Complex Numbers?

I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient. How can…

c++ avx avx2

asked Jul 12 '16 at 14:29

user1777820

votes

1 answer

Automatically use AVX/SSE if available at runtime?

Dupe of Have different optimizations (plain, SSE, AVX) in the same executable with C/C++ The "Auto-duplicate" think picked the wrong suggested duplicate, and I don't seem to have the interface to fix it. Is there any way to build a application that…

c++ visual-c++ avx

asked Jul 11 '16 at 22:54

Fake Name

5,556
5
44
66

votes

1 answer

What am I doing with SIMD and pthreads that is slowing my program down?

!!! HOMEWORK - ASSIGNMENT !!! Please do not post code as I would like to complete myself but rather if possible point me in the right direction with general information or by pointing out mistakes in thought or other possible useful and relevant…

c multithreading pthreads simd avx

asked May 31 '16 at 09:43

joshuatvernon

1,530
2
23
45

votes

1 answer

Convert int to double in AVX x86

I have an external function: extern "C" void calculateAreaUnderCurve_(double startPoint, double endPoint, int numberOfTrapezes, double* coefficients, double* result); I'd like to convert numberOfTrapezes to a double in my .asm file. I tried with:…

assembly x86 x86-64 masm avx

asked May 28 '16 at 00:46

kstanisz

votes

0 answers

AVX; byte multiplication; sum;

I'm optimising the following code with AVX and want to know your opinion about the best approach. There are two blocks of data uint8 x[3][3]; uint8 y[3][3]; result is uint8 value which is sum of multiplication of corresponding elements like res =…

assembly 64-bit sse simd avx

asked May 10 '16 at 14:31

user3124812

1,861
3
18
39

votes

1 answer

AVX2 __m256i const* mem_addr in load instructions vs AVX

I can not load or store with AVX2 intrinsics instructions as I've done in AVX before. No error, just warnings, and it does not perform the load/store instruction at run-time. Other AVX2 instructions work properly but I can not load from memory. As…

c x86 simd avx avx2

asked Mar 03 '16 at 17:34

ADMS

votes

1 answer

MSVC 2015 AVX2 debugging problems. Not all SIMD lanes are populated correctly

I'm having trouble debugging my AVX2 code in Visual Studio 2015, update 1 (targeting Win10). When using the debugger and inspecting an AVX2 register, the contents differs when using a breakpoint and stepping over the _mm256_insertf128_ps-intrinsic…

visual-studio-2015 avx avx2

asked Mar 01 '16 at 09:57

repstosq

votes

1 answer

For some reason serial code runs faster than SIMD code

For some reason running the simple serial code for(i=0;i<1152*1152;i++){ MatrixA3[i] = MatrixA1[i] + z*MatrixA2[i];} runs faster than or same speed with the vectorized equivalent; for (int i = 0; i < 1152*1152; i+=4){ load_data1 =…

c++ avx avx2

asked Jun 11 '15 at 19:04

Tracy Maxen

Prev 1 2 3

…

83 84 Next