Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

  • Scalar -> Vector register broadcast
  • Gather loads for loading a vector from different memory locations.
  • Masked memory loads/stores
  • New permute instructions
  • Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions
12
votes
1 answer

Why are some Haswell AVX latencies advertised by Intel as 3x slower than Sandy Bridge?

In the Intel intrinsics webapp, several operations seem to have worsened from Sandy Bridge to Haswell. For example, many insert operations like _mm256_insertf128_si256 show a cost table like the following: Performance Architecture Latency …
orm
  • 2,835
  • 2
  • 22
  • 35
12
votes
3 answers

Emulating shifts on 32 bytes with AVX

I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics. Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately…
user1196549
11
votes
0 answers

What's the difference between the XOR instructions "VPXORD", "VXORPS" and "VXORPD" in Intel's AVX2

I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXORPS" However, per my understanding, they should all…
Harper
  • 1,794
  • 14
  • 31
11
votes
2 answers

What do you do without fast gather and scatter in AVX2 instructions?

I'm writing a program to detect primes numbers. One part is bit sieving possible candidates out. I've written a fairly fast program but I thought I'd see if anyone has some better ideas. My program could use some fast gather and scatter…
ChipK
  • 401
  • 2
  • 9
  • 20
11
votes
1 answer

Efficient way to set first N or last N bits of __m256i to 1, the rest to 0

How to set to 1 efficiently with AVX2 first N bits last N bits of __m256i, setting the rest to 0? These are 2 separate operations for tail and head of a bit range, when the range may start and end in the middle of __m256i value. The part of the…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
11
votes
1 answer

Fallback implementation for conflict detection in AVX2

AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2? I'm not interested in the extact bits I just…
Christoph Diegelmann
  • 2,004
  • 15
  • 26
11
votes
3 answers

Packing and de-interleaving two __m256 registers

I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers. ...a0.........b0...... ...a1.........b1...... // ... ...a7.........b7...... // end first __m256 A naive way to do…
ZachB
  • 13,051
  • 4
  • 61
  • 89
11
votes
1 answer

Where is VPERMB in AVX2?

AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD: it allows you to totally arbitrarily broadcast/shuffle/permute from one 256-bit long vector of…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
11
votes
1 answer

Is this incorrect code generation with arrays of __m256 values a clang bug?

I'm encountering what appears to be a bug causing incorrect code generation with clang 3.4, 3.5, and 3.6 trunk. The source that actually triggered the problem is quite complicated, but I've been able to reduce it to this self-contained…
Jason R
  • 11,159
  • 6
  • 50
  • 81
11
votes
1 answer

Fastest way to unpack 32 bits to a 32 byte SIMD vector

Having 32 bits stored in a uint32_t in memory, what's the fastest way to unpack each bit to a separate byte element of an AVX register? The bits can be in any position within their respective byte. Edit: to clarify, I mean bit 0 goes to byte 0, bit…
alecco
  • 2,914
  • 1
  • 28
  • 37
11
votes
1 answer

How can I add together two SSE registers

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like…
Martinsos
  • 1,663
  • 15
  • 31
10
votes
1 answer

What is packed and unpacked and extended packed data

I have been going through Intel Intrinsics and every function is working on integers or floats or double that are packed or unpacked or extended packed. It seems like this question should be answered some where on the internet but I can't find the…
Omar Khalid
  • 324
  • 1
  • 3
  • 15
10
votes
1 answer

Best way to load/store from/to general purpose registers to/from xmm/ymm register

What is best way to load and store generate purpose registers to/from SIMD registers? So far I have been using the stack as a temporary. For example, mov [rsp + 0x00], r8 mov [rsp + 0x08], r9 mov [rsp + 0x10], r10 mov [rsp + 0x18], r11 vmovdqa ymm0,…
Yan Zhou
  • 2,709
  • 2
  • 22
  • 37
10
votes
0 answers

Is there, or will there be, a "global" version of the target_clones attribute?

I've recently played around with the target_clones attribute available from gcc 6.1 and onward. It's quite nifty, but, for now, it requires a somewhat clumsy approach; every function that one wants multi-versioned has to have an attribute declared…
bolind
  • 512
  • 3
  • 15
10
votes
3 answers

AVX2 slower than SSE on Haswell

I have the following code (normal, SSE and AVX): int testSSE(const aligned_vector & ghs, const aligned_vector & lhs) { int result[4] __attribute__((aligned(16))) = {0}; __m128i vresult = _mm_set1_epi32(0); __m128i v1, v2, vmax; for…
Alexandros
  • 2,160
  • 4
  • 27
  • 52
1 2
3
45 46