Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

Scalar -> Vector register broadcast
Gather loads for loading a vector from different memory locations.
Masked memory loads/stores
New permute instructions
Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions

votes

1 answer

Why are some Haswell AVX latencies advertised by Intel as 3x slower than Sandy Bridge?

In the Intel intrinsics webapp, several operations seem to have worsened from Sandy Bridge to Haswell. For example, many insert operations like _mm256_insertf128_si256 show a cost table like the following: Performance Architecture Latency …

asked Mar 08 '16 at 05:08

orm

2,835
2
22
35

votes

3 answers

Emulating shifts on 32 bytes with AVX

I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics. Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately…

c++ simd intrinsics sse2 avx2

asked Aug 11 '14 at 17:14

user1196549

votes

0 answers

What's the difference between the XOR instructions "VPXORD", "VXORPS" and "VXORPD" in Intel's AVX2

I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXORPS" However, per my understanding, they should all…

x86 cpu-architecture avx avx2 avx512

asked Mar 05 '19 at 18:32

Harper

1,794
14
31

votes

2 answers

What do you do without fast gather and scatter in AVX2 instructions?

I'm writing a program to detect primes numbers. One part is bit sieving possible candidates out. I've written a fairly fast program but I thought I'd see if anyone has some better ideas. My program could use some fast gather and scatter…

algorithm performance optimization simd avx2

asked Jul 02 '18 at 00:41

ChipK

votes

1 answer

Efficient way to set first N or last N bits of __m256i to 1, the rest to 0

How to set to 1 efficiently with AVX2 first N bits last N bits of __m256i, setting the rest to 0? These are 2 separate operations for tail and head of a bit range, when the range may start and end in the middle of __m256i value. The part of the…

c++ bit-manipulation vectorization x86-64 avx2

asked Sep 03 '17 at 15:16

Serge Rogatch

13,865
7
86
158

votes

1 answer

Fallback implementation for conflict detection in AVX2

AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2? I'm not interested in the extact bits I just…

c++ x86 intrinsics avx2 avx512

asked Jun 30 '17 at 09:47

Christoph Diegelmann

2,004
15
26

votes

3 answers

Packing and de-interleaving two __m256 registers

I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers. ...a0.........b0...... ...a1.........b1...... // ... ...a7.........b7...... // end first __m256 A naive way to do…

c++ x86 simd avx avx2

asked Feb 27 '17 at 23:58

ZachB

13,051
4
61
89

votes

1 answer

Where is VPERMB in AVX2?

AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD: it allows you to totally arbitrarily broadcast/shuffle/permute from one 256-bit long vector of…

assembly x86 intel sse avx2

asked Jun 23 '16 at 00:09

BeeOnRope

60,350
16
207
386

votes

1 answer

Is this incorrect code generation with arrays of __m256 values a clang bug?

I'm encountering what appears to be a bug causing incorrect code generation with clang 3.4, 3.5, and 3.6 trunk. The source that actually triggered the problem is quite complicated, but I've been able to reduce it to this self-contained…

c++ clang compiler-optimization avx2

asked Feb 11 '15 at 19:20

Jason R

11,159
6
50
81

votes

1 answer

Fastest way to unpack 32 bits to a 32 byte SIMD vector

Having 32 bits stored in a uint32_t in memory, what's the fastest way to unpack each bit to a separate byte element of an AVX register? The bits can be in any position within their respective byte. Edit: to clarify, I mean bit 0 goes to byte 0, bit…

x86 simd avx bitmask avx2

asked Jun 15 '14 at 01:27

alecco

2,914
1
28
37

votes

1 answer

How can I add together two SSE registers

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like…

c++ c intel sse avx2

asked Jun 11 '14 at 11:00

Martinsos

1,663
15
31

votes

1 answer

What is packed and unpacked and extended packed data

I have been going through Intel Intrinsics and every function is working on integers or floats or double that are packed or unpacked or extended packed. It seems like this question should be answered some where on the internet but I can't find the…

cpu-architecture sse simd avx avx2

asked Oct 29 '20 at 23:21

Omar Khalid

votes

1 answer

Best way to load/store from/to general purpose registers to/from xmm/ymm register

What is best way to load and store generate purpose registers to/from SIMD registers? So far I have been using the stack as a temporary. For example, mov [rsp + 0x00], r8 mov [rsp + 0x08], r9 mov [rsp + 0x10], r10 mov [rsp + 0x18], r11 vmovdqa ymm0,…

assembly x86 simd sse2 avx2

asked Nov 16 '16 at 03:52

Yan Zhou

2,709
2
22
37

votes

0 answers

Is there, or will there be, a "global" version of the target_clones attribute?

I've recently played around with the target_clones attribute available from gcc 6.1 and onward. It's quite nifty, but, for now, it requires a somewhat clumsy approach; every function that one wants multi-versioned has to have an attribute declared…

gcc avx avx2 gcc6

asked Oct 11 '16 at 14:41

bolind

votes

3 answers

AVX2 slower than SSE on Haswell

I have the following code (normal, SSE and AVX): int testSSE(const aligned_vector & ghs, const aligned_vector & lhs) { int result[4] __attribute__((aligned(16))) = {0}; __m128i vresult = _mm_set1_epi32(0); __m128i v1, v2, vmax; for…

c++ x86 sse simd avx2

asked May 06 '14 at 14:28

Alexandros

2,160
4
27
52

Prev 1 2

…

45 46 Next