Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

Scalar -> Vector register broadcast
Gather loads for loading a vector from different memory locations.
Masked memory loads/stores
New permute instructions
Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions

votes

2 answers

Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros. Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats: shift1_SSE: [1, 2, 3, 4] -> [0, 1,…

asked Oct 22 '13 at 11:27

Z boson

32,619
11
123
226

votes

0 answers

Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array. On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd with AVX512 or vpmovmskb with AVX256. When I use…

c++ clang compiler-optimization avx2 avx512

asked Jan 06 '23 at 12:21

He3lixxx

3,263
1
12
31

votes

0 answers

Efficient Way of shuffling 3 bit values inside an AVX2/ymm register

I have an interesting problem that can't think of an efficient way of solving with vectorized code. I have a ymm register with 8 32-bit integers, where each integer is made up of: Lower 24 bits are 8x3bit "individual" values Top 8 bits contain a…

c sse simd avx avx2

asked Dec 01 '19 at 08:42

damageboy

2,097
19
34

votes

3 answers

Count leading zero bits for each element in AVX2 vector, emulate _mm256_lzcnt_epi32

With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element. Is there an efficient way to implement this using AVX and…

bit-manipulation simd avx avx2 avx512

asked Nov 12 '19 at 16:46

tmlen

8,533
5
31
84

votes

2 answers

Fastest precise way to convert a vector of integers into floats between 0 and 1

Consider a randomly generated __m256i vector. Is there a faster precise way to convert them into __m256 vector of floats between 0 (inclusively) and 1 (exclusively) than division by float(1ull<<32)? Here's what I have tried so far, where iRand is…

c random vectorization simd avx2

asked Feb 25 '19 at 15:34

Serge Rogatch

13,865
7
86
158

votes

1 answer

How to implement an efficient _mm256_madd_epi8 dot-products of groups of four i8 elements?

Intel provides a C style function named _mm256_madd_epi16, which basically __m256i _mm256_madd_epi16 (__m256i a, __m256i b) Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent…

c++ x86 simd intrinsics avx2

asked Jul 17 '18 at 13:11

Amor Fati

votes

2 answers

Counting 1 bits (population count) on large data using AVX-512 or AVX-2

I have a long chunk of memory, say, 256 KiB or longer. I want to count the number of 1 bits in this entire chunk, or in other words: Add up the "population count" values for all bytes. I know that AVX-512 has a VPOPCNTDQ instruction which counts the…

assembly avx2 avx512 bitcount population-count

asked Apr 28 '18 at 22:04

einpoklum

118,144
57
340
684

votes

2 answers

Efficient implementation of log2(__m256d) in AVX2

SVML's __m256d _mm256_log2_pd (__m256d a) is not available on other compilers than Intel, and they say its performance is handicapped on AMD processors. There are some implementations on the internet referred in AVX log intrinsics (_mm256_log_ps)…

c++ algorithm floating-point logarithm avx2

asked Aug 19 '17 at 09:50

Serge Rogatch

13,865
7
86
158

votes

2 answers

Shuffle elements of __m256i vector

I want to shuffle elements of __m256i vector. And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle. How can I do it with using AVX2 instructions?

c++ simd avx2

asked Jun 05 '15 at 14:50

user4792273

votes

1 answer

Parallel programming using Haswell architecture

I want to learn about parallel programming using Intel's Haswell CPU microarchitecture. About using SIMD: SSE4.2, AVX2 in asm/C/C++/(any other langs)?. Can you recommend books, tutorials, internet resources, courses? Thanks!

sse cpu-architecture avx avx2

asked Jan 05 '14 at 12:47

Boris Ivanov

4,145
1
32
40

votes

3 answers

_mm_alignr_epi8 (PALIGNR) equivalent in AVX2

In SSE3, the PALIGNR instruction performs the following: PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the…

x86 simd intrinsics avx avx2

asked Dec 15 '11 at 09:39

eladidan

2,634
2
26
39

votes

5 answers

Fast modulo-12 algorithm for 4 uint16_t's packed in a uint64_t

Consider the following union: union Uint16Vect { uint16_t _comps[4]; uint64_t _all; }; Is there a fast algorithm for determining whether each component equals 1 modulo 12 or not? A naive sequence of code is: Uint16Vect F(const Uint16Vect a)…

c algorithm vectorization modulo avx2

asked Feb 16 '19 at 17:41

Serge Rogatch

13,865
7
86
158

votes

3 answers

How to count character occurrences using SIMD

I am given a array of lowercase characters (up to 1.5Gb) and a character c. And I want to find how many occurrences are of the character c using AVX instructions. unsigned long long char_count_AVX2(char * vector, int size, char c){ unsigned…

c simd avx avx2

asked Feb 05 '19 at 18:47

Adamos2468

votes

2 answers

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

What I want to do is: Multiply the input floating point number by a fixed factor. Convert them to 8-bit signed char. Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127,…

c x86 simd intrinsics avx2

asked Aug 10 '18 at 03:54

Amor Fati

votes

1 answer

perf report shows this function "__memset_avx2_unaligned_erms" has overhead. does this mean memory is unaligned?

I am trying to profile my C++ code using perf tool. Implementation contains code with SSE/AVX/AVX2 instructions. In addition to that code is compiled with -O3 -mavx2 -march=native flags. I believe __memset_avx2_unaligned_erms function is a libc…

c++ profiling avx perf avx2

asked Jul 31 '18 at 13:27

yadhu

1,253
14
25

Prev 1 2 3

…

45 46 Next