Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

  • Scalar -> Vector register broadcast
  • Gather loads for loading a vector from different memory locations.
  • Masked memory loads/stores
  • New permute instructions
  • Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions
10
votes
2 answers

Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros. Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats: shift1_SSE: [1, 2, 3, 4] -> [0, 1,…
Z boson
  • 32,619
  • 11
  • 123
  • 226
9
votes
0 answers

Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array. On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd with AVX512 or vpmovmskb with AVX256. When I use…
He3lixxx
  • 3,263
  • 1
  • 12
  • 31
9
votes
0 answers

Efficient Way of shuffling 3 bit values inside an AVX2/ymm register

I have an interesting problem that can't think of an efficient way of solving with vectorized code. I have a ymm register with 8 32-bit integers, where each integer is made up of: Lower 24 bits are 8x3bit "individual" values Top 8 bits contain a…
damageboy
  • 2,097
  • 19
  • 34
9
votes
3 answers

Count leading zero bits for each element in AVX2 vector, emulate _mm256_lzcnt_epi32

With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element. Is there an efficient way to implement this using AVX and…
tmlen
  • 8,533
  • 5
  • 31
  • 84
9
votes
2 answers

Fastest precise way to convert a vector of integers into floats between 0 and 1

Consider a randomly generated __m256i vector. Is there a faster precise way to convert them into __m256 vector of floats between 0 (inclusively) and 1 (exclusively) than division by float(1ull<<32)? Here's what I have tried so far, where iRand is…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
9
votes
1 answer

How to implement an efficient _mm256_madd_epi8 dot-products of groups of four i8 elements?

Intel provides a C style function named _mm256_madd_epi16, which basically __m256i _mm256_madd_epi16 (__m256i a, __m256i b) Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent…
Amor Fati
  • 337
  • 2
  • 7
9
votes
2 answers

Counting 1 bits (population count) on large data using AVX-512 or AVX-2

I have a long chunk of memory, say, 256 KiB or longer. I want to count the number of 1 bits in this entire chunk, or in other words: Add up the "population count" values for all bytes. I know that AVX-512 has a VPOPCNTDQ instruction which counts the…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
9
votes
2 answers

Efficient implementation of log2(__m256d) in AVX2

SVML's __m256d _mm256_log2_pd (__m256d a) is not available on other compilers than Intel, and they say its performance is handicapped on AMD processors. There are some implementations on the internet referred in AVX log intrinsics (_mm256_log_ps)…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
9
votes
2 answers

Shuffle elements of __m256i vector

I want to shuffle elements of __m256i vector. And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle. How can I do it with using AVX2 instructions?
user4792273
9
votes
1 answer

Parallel programming using Haswell architecture

I want to learn about parallel programming using Intel's Haswell CPU microarchitecture. About using SIMD: SSE4.2, AVX2 in asm/C/C++/(any other langs)?. Can you recommend books, tutorials, internet resources, courses? Thanks!
Boris Ivanov
  • 4,145
  • 1
  • 32
  • 40
8
votes
3 answers

_mm_alignr_epi8 (PALIGNR) equivalent in AVX2

In SSE3, the PALIGNR instruction performs the following: PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the…
eladidan
  • 2,634
  • 2
  • 26
  • 39
8
votes
5 answers

Fast modulo-12 algorithm for 4 uint16_t's packed in a uint64_t

Consider the following union: union Uint16Vect { uint16_t _comps[4]; uint64_t _all; }; Is there a fast algorithm for determining whether each component equals 1 modulo 12 or not? A naive sequence of code is: Uint16Vect F(const Uint16Vect a)…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
8
votes
3 answers

How to count character occurrences using SIMD

I am given a array of lowercase characters (up to 1.5Gb) and a character c. And I want to find how many occurrences are of the character c using AVX instructions. unsigned long long char_count_AVX2(char * vector, int size, char c){ unsigned…
Adamos2468
  • 151
  • 1
  • 9
8
votes
2 answers

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

What I want to do is: Multiply the input floating point number by a fixed factor. Convert them to 8-bit signed char. Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127,…
Amor Fati
  • 337
  • 2
  • 7
8
votes
1 answer

perf report shows this function "__memset_avx2_unaligned_erms" has overhead. does this mean memory is unaligned?

I am trying to profile my C++ code using perf tool. Implementation contains code with SSE/AVX/AVX2 instructions. In addition to that code is compiled with -O3 -mavx2 -march=native flags. I believe __memset_avx2_unaligned_erms function is a libc…
yadhu
  • 1,253
  • 14
  • 25