Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.


Interesting Q&As / FAQs:

1252 questions
0
votes
0 answers

SSE/AVX instructions to accellerate the expression u32 = (z << 16) | (y << 8) | x

I have 3 unsigned ints with range [0, 255]. I want to store these 3 numbers to a compact storage and since this operation happens too often I want to know how I can improve it. Initially I tried this: struct Foo { uint8_t x; uint8_t y; …
0
votes
0 answers

identifier "intrinsic function" is undefined

when I compiled my intel AVX code using intrinsic functions and intel compiler 2016 in visual studio C++ 2015 that error appears for all intrinsics: for example: identifier"_mm256_broadcast_ss" is undefined. this sample of my code: …
semsem
  • 1
  • 1
0
votes
1 answer

How to disable AVX instructions in OpenSSL?

I've got a problem with running iOS application created with RoboVM framework. Probably this is caused by my processor which is not supporting AVX instruction. I found the page: https://www.openssl.org/docs/manmaster/crypto/OPENSSL_ia32cap.html with…
Blady214
  • 729
  • 6
  • 19
0
votes
2 answers

Does the bitwise operation (&, ^. | etc) provided as operator overloads in the std::bitset use AVX or SSE4 instructions?

Since this is implementation dependent, is the only way to find that out is through the disassembly?
0
votes
0 answers

Demultiplex an AVX register into four registers each containing identical values

I have an array double x[4] of four doubles stored contiguously in memory. What would be the fastest (in terms of efficient) way using the AVX instruction set to prepare four registers, say, ymm0,ymm1,ymm2,ymm3 such that : ymm0 = { x[0], x[0], x[0],…
Tomas
  • 61
  • 1
  • 4
0
votes
0 answers

SIMD (AVX2) mask store and pack

I am trying to perform the following operation in AVX2 code (dest, data, and mask are int32 pointers): int j=0; for(i=0; i
nineties
  • 423
  • 1
  • 7
  • 17
0
votes
0 answers

SSE Efficient signed short convolution

I am trying to implement fixed point 7X7 convolution on large signed short images (1000X1000). The (float) kernel is scaled up (by 1<<14) to get valid results, and the final results are scaled down back. I am implementing it using SSE. Working on…
user1014366
  • 95
  • 1
  • 1
  • 4
0
votes
1 answer

AVX, Horizontal Sum of Single Precision Complex Numbers?

I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient. How can…
user1777820
  • 728
  • 9
  • 29
0
votes
1 answer

Automatically use AVX/SSE if available at runtime?

Dupe of Have different optimizations (plain, SSE, AVX) in the same executable with C/C++ The "Auto-duplicate" think picked the wrong suggested duplicate, and I don't seem to have the interface to fix it. Is there any way to build a application that…
Fake Name
  • 5,556
  • 5
  • 44
  • 66
0
votes
1 answer

What am I doing with SIMD and pthreads that is slowing my program down?

!!! HOMEWORK - ASSIGNMENT !!! Please do not post code as I would like to complete myself but rather if possible point me in the right direction with general information or by pointing out mistakes in thought or other possible useful and relevant…
joshuatvernon
  • 1,530
  • 2
  • 23
  • 45
0
votes
1 answer

Convert int to double in AVX x86

I have an external function: extern "C" void calculateAreaUnderCurve_(double startPoint, double endPoint, int numberOfTrapezes, double* coefficients, double* result); I'd like to convert numberOfTrapezes to a double in my .asm file. I tried with:…
kstanisz
  • 207
  • 3
  • 15
0
votes
0 answers

AVX; byte multiplication; sum;

I'm optimising the following code with AVX and want to know your opinion about the best approach. There are two blocks of data uint8 x[3][3]; uint8 y[3][3]; result is uint8 value which is sum of multiplication of corresponding elements like res =…
user3124812
  • 1,861
  • 3
  • 18
  • 39
0
votes
1 answer

AVX2 __m256i const* mem_addr in load instructions vs AVX

I can not load or store with AVX2 intrinsics instructions as I've done in AVX before. No error, just warnings, and it does not perform the load/store instruction at run-time. Other AVX2 instructions work properly but I can not load from memory. As…
ADMS
  • 117
  • 3
  • 18
0
votes
1 answer

MSVC 2015 AVX2 debugging problems. Not all SIMD lanes are populated correctly

I'm having trouble debugging my AVX2 code in Visual Studio 2015, update 1 (targeting Win10). When using the debugger and inspecting an AVX2 register, the contents differs when using a breakpoint and stepping over the _mm256_insertf128_ps-intrinsic…
repstosq
  • 3
  • 1
0
votes
1 answer

For some reason serial code runs faster than SIMD code

For some reason running the simple serial code for(i=0;i<1152*1152;i++){ MatrixA3[i] = MatrixA1[i] + z*MatrixA2[i];} runs faster than or same speed with the vectorized equivalent; for (int i = 0; i < 1152*1152; i+=4){ load_data1 =…