Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

  • Scalar -> Vector register broadcast
  • Gather loads for loading a vector from different memory locations.
  • Masked memory loads/stores
  • New permute instructions
  • Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions
15
votes
2 answers

Aligned and unaligned memory access with AVX/AVX2 intrinsics

According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g. vaddps ymm0,ymm0,YMMWORD PTR [rax] the load address doesn't have to…
Ralf
  • 1,203
  • 1
  • 11
  • 20
15
votes
3 answers

Load address calculation when using AVX2 gather instructions

Looking at the AVX2 intrinsics documentation there are gathered load instructions such as VPGATHERDD: __m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale); What isn't clear to me from the documentation is whether the…
Paul R
  • 208,748
  • 37
  • 389
  • 560
15
votes
2 answers

Scatter intrinsics in AVX

I can't find them in the Intel Intrinsic Guide v2.7. Do you know if AVX or AVX2 instruction sets support them?
elmattic
  • 12,046
  • 5
  • 43
  • 79
14
votes
5 answers

Fastest Implementation of Exponential Function Using AVX

I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. Relative Accuracy should be something like ~1e-6, or…
Royi
  • 4,640
  • 6
  • 46
  • 64
14
votes
2 answers

What's the fastest stride-3 gather instruction sequence?

The question: What is the most efficient sequence to generate a stride-3 gather of 32-bit elements from memory? If the memory is arranged as: MEM = R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 ... We want to obtain three YMM registers where: YMM0 = R0 R1 R2…
zr.
  • 7,528
  • 11
  • 50
  • 84
13
votes
1 answer

Disabling AVX2 in CPU for testing purposes

I've got an application that requires AVX2 to work correctly. A check was implemented to check during application start if CPU has AVX2 instruction. I would like to check if it works correctly, but i only have CPU that has AVX2. Is there a way to…
Biba
  • 631
  • 9
  • 28
13
votes
1 answer

Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256). They both seem to be doing exactly the same, and their respective latencies and throughputs…
mSSM
  • 598
  • 5
  • 12
13
votes
1 answer

Why does storing to and loading from an AVX2 256bit vector have different results in debug and release mode?

When I try to store and load 256bits to and from an AVX2 256bit vector, I'm not receiving expected output in release mode. use std::arch::x86_64::*; fn main() { let key = [1u64, 2, 3, 4]; let avxreg = unsafe {…
Nick Babcock
  • 6,111
  • 3
  • 27
  • 43
13
votes
3 answers

How to clear the upper 128 bits of __m256 value?

How can I clear the upper 128 bits of m2: __m256i m2 = _mm256_set1_epi32(2); __m128i m1 = _mm_set1_epi32(1); m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2)); m2 = _mm256_castsi128_si256(m1); don't work -- Intel’s documentation for…
seda
  • 141
  • 5
13
votes
1 answer

8 bit shift operation in AVX2 with shifting in zeros

Is there any way to rebuild the _mm_slli_si128 instruction in AVX2 to shift an __mm256i register by x bytes? The _mm256_slli_si256 seems just to execute two _mm_slli_si128 on a[127:0] and a[255:128]. The left shift should work on a __m256i like…
martin s
  • 1,121
  • 1
  • 12
  • 29
13
votes
2 answers

What's the difference between vextracti128 and vextractf128?

vextracti128 and vextractf128 have the same functionality, parameters, and return values. In addition one is AVX instruction set while the other is AVX2. What is the difference?
user2813757
  • 141
  • 1
  • 3
12
votes
3 answers

How to enable AVX / AVX2 in VirtualBox 6.1.16 with Ubuntu 20.04 64bit?

TL;DR: Tensorflow 1.15 crashes on my virtual machine when imported by Python (error message is Illegal instruction (core dumped)), very probably thanks to AVX and AVX2 being disabled on it. My host machine (Windows 10 64bit) has AVX and AVX2…
SomethingSomething
  • 11,491
  • 17
  • 68
  • 126
12
votes
2 answers

What's the fastest way to perform an arbitrary 128/256/512 bit permutation using SIMD instructions?

I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible. For this I was looking into SIMD instructions. Does…
J Bausch
  • 123
  • 6
12
votes
3 answers

Do all CPUs which support AVX2 also support SSE4.2 and AVX?

I am planning to implement runtime detection of SIMD extensions. Is it such that if I find out that the processor has AVX2 support, it is also guaranteed to have SSE4.2 and AVX support?
rubund
  • 7,603
  • 3
  • 15
  • 24
12
votes
1 answer

is there an inverse instruction to the movemask instruction in intel avx2?

The movemask instruction(s) take an __m256i and return an int32 where each bit (either the first 4, 8 or all 32 bits depending on the input vector element type) is the most significant bit of the corresponding vector element. I would like to do the…
orm
  • 2,835
  • 2
  • 22
  • 35
1
2
3
45 46