Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

  • Scalar -> Vector register broadcast
  • Gather loads for loading a vector from different memory locations.
  • Masked memory loads/stores
  • New permute instructions
  • Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions
8
votes
1 answer

How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR)

I have implemented an inline function (_mm256_concat_epi16). It concatenates two AVX2 vector containing 16-bit values. It works fine for first 8 numbers. If I want to use it for the rest of the vector I should change the implementation. But It would…
Amiri
  • 2,417
  • 1
  • 15
  • 42
8
votes
2 answers

Fully utilizing pipelines on kaby lake

(Followup code-review question here, with more details of the context of this loop.) Environment: Windows 7 x64 VS 2017 community Targeting x64 code on Intel i7700k (kaby lake) I don't write a lot of assembler code, and when I do, it's either…
David Wohlferd
  • 7,110
  • 2
  • 29
  • 56
8
votes
1 answer

Find the first instance of a character using simd

I am trying to find the first instance of a character, in this case '"' using simd (AVX2 or earlier). I'd like to use _mm256_cmpeq_epi8, but then I need a quick way of finding if any of the resulting bytes in the __m256i have been set to 0xFF. The…
Jimbo
  • 2,886
  • 2
  • 29
  • 45
8
votes
2 answers

Efficient way of rotating a byte inside an AVX register

Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together? For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be…
oPolo
  • 516
  • 3
  • 14
8
votes
2 answers

Why do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?

I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand…
eoinmullan
  • 1,157
  • 1
  • 9
  • 32
8
votes
1 answer

Loading 8 chars from memory into an __m256 variable as packed single precision floats

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task? // unsigned char…
pseudomarvin
  • 1,477
  • 2
  • 17
  • 32
8
votes
2 answers

Optimal SIMD algorithm to rotate or transpose an array

I am working on a data structure where I have an array of 16 uint64. They are laid out like this in memory (each below representing a single int64): A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3 The desired result is to transpose the array into…
Thomas Kejser
  • 1,264
  • 1
  • 10
  • 30
8
votes
3 answers

Does /arch:AVX enable AVX2?

Does /arch:AVX enable AVX2 (with 256-bit integer SIMD instructions and some new FP shuffles) on the Visual Studio 2012 Update 4? Line of thought: Yes, it enables AVX because VS doesn't mention AVX2. But I think VS can do AVX2 because my intrinsic…
Mikhail
  • 7,749
  • 11
  • 62
  • 136
8
votes
1 answer

How to store lower or higher values from AVX/AVX2(YMM) register to memory like the SSE movlps/movhps does?

Is there any existing instructions which could store lower or higher values from a 256 bit AVX/AVX2(YMM) register to memory address, just like the SSE instruction movlps/movhps does? Or is there any other way to implement this? Any help would be…
Sean Yang
  • 171
  • 1
  • 4
7
votes
1 answer

Is using AVX2 can implement a faster processing of LZCNT on a word array?

I need to bit scan reverse with LZCNT an array of words: 16 bits. The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to be 4. I am trying to find an algorithm using the…
Guy B
  • 217
  • 4
  • 15
7
votes
1 answer

gcc auto vectorization control flow in loop

In the code below, why is the second loop able to be auto vectorized but the first cannot? How can I modify the code so it does auto vectorize? gcc says: note: not vectorized: control flow in loop. I am using gcc 8.2, flags are -O3…
user2133814
  • 2,431
  • 1
  • 24
  • 34
7
votes
1 answer

Get an arbitrary float from a simd register at runtime?

I want to access an arbitrary float from a simd register. I know that I can do things like: float get(const __m128i& a, const int idx){ // editor's note: this type-puns the FP bit-pattern to int and converts to float return…
BadProgrammer99
  • 759
  • 1
  • 5
  • 13
7
votes
3 answers

Using a variable to index a simd vector with _mm256_extract_epi32() intrinsic

I am using the AVX intrinsic _mm256_extract_epi32(). I am not entirely sure if I am using it correctly, though, because gcc doesn't like my code, whereas clang compiles it and runs it without issue. I am extracting the lane based on the value of an…
Bram
  • 7,440
  • 3
  • 52
  • 94
7
votes
1 answer

Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?

Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok. So far I'm using the following code, but profiler says it's slow on Ryzen…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
7
votes
1 answer

gdb reverse debugging avx2

So I have a new fancy cpu that supports avx2 instruction set. This is great, but breaks gdb reverse debugging. When compiling with no optimisations code still uses shared libraries, eg calls memset() which then goes and invokes an avx2 optimised…
Hal
  • 1,061
  • 7
  • 20