Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

  • Scalar -> Vector register broadcast
  • Gather loads for loading a vector from different memory locations.
  • Masked memory loads/stores
  • New permute instructions
  • Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions
0
votes
0 answers

How to use shuffle control mask

I think the vpshufb instruction would work well for something that I'm trying to do, but I don't know how to use the shuffle control mask to control where parts of the vector are shuffled, and I cannot find information on how to do this on the…
0
votes
0 answers

Anaconda Tensorflow Compiler Issue CPU AVX AVX2

I installed Tensorflow via Anaconda, I tried testing if it works using the short program on the website, but I ended up with this error. Is there something wrong, or is it my CPU can't handle it? Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018,…
0
votes
0 answers

Is there a penalty for mixing x86-64 integer instructions with AVX1/2/512 instructions?

I have seen a lot of assembly with AVX(all three flavors), and in all the cases that I have seen the most concentrated a kind of instruction is the best the code performs. But, for example, things like doing a load into a 32-bit register and then…
JLV
  • 1
0
votes
0 answers

When can I call xsaves and xsaves64?

When is it allowed to call xsaves and xsaves64? Using Intel Software Development Emulator (8.12.0-2017-10-23), I can use xsaves64 + xrstors64 from user-space without any problems, but trying to use xsaves + xrstors produces: Illegal instruction at…
gnzlbg
  • 7,135
  • 5
  • 53
  • 106
0
votes
0 answers

C AVX2 sum array horizontaly

I have some problems with AVX2 instructions. I wrote a program in c which read a binary file with unsigned chars then sum them. Now i want to replace the c for loop with AVX2 instructions but it doesnt work. Thats the first time i want to use AVX2.…
AsdFork
  • 15
  • 5
0
votes
3 answers

AVX2 1GB long array

I have a 1gb long array with floats in a .bin file. After i read it how can i sum the elements with avx2 instrucion, and print the result? I edited my code with Jake 'Alquimista' LEE's answer. The problem is the result much smaller than it will be.…
RafaNadal95
  • 5
  • 1
  • 4
0
votes
0 answers

AVX Command Error for integer addition

Has anyone know how to resolve these types of error? I am trying to add two 256-bit integer vector, but getting following error: cpu_avx.c:12:20: error: incompatible types when initializing type ‘__m256i’ using type ‘int’ __m256i result =…
Sagar
  • 1
  • 1
0
votes
1 answer

SIMD -> uint16_t array to float array work on float then back to uint16_t

I am currently working on a project that manipulates images. To speed up the process (and increase my knowledge), I decided to write some of the basic functions using SIMD instructions. The code using for loops is int idx; uint16_t* A, B, C; float…
0
votes
1 answer

How to detect a Xeon Phi (Knights Landing)

Intel engineers wrote that we should use VZEROUPPER/VZEROALL to avoid costly transition to non-VEX state on all processors, including future Xeon processor, but not on Xeon Phi: https://software.intel.com/pt-br/node/704023 People have also measured…
Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
0
votes
0 answers

Why this vectorization fails on AVX-512 and not on AVX2?

I have this code which I test on my AVX2 machine: bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res) { bool ret = false; // input size (-1 for the safe bilinear…
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138
0
votes
1 answer

256 bit CRC calculation on AVX2

64 bit CRC function exists on Intel SSE4.2 intrinsics. unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v) However I can't find 256 bit version of CRC calculation on AVX2 intrinsics. I'm using 256 bit variables (__m256i) on…
0
votes
1 answer

What is the avx2 instruction to store 8 integers?

I want to store the 8 integers from a __m256i variable to an array of 8 x 32 bit ints. I thought the instruction for that would be _mm256_store_epi32, but I get an error that this instruction doesn't even exist!
pythonic
  • 20,589
  • 43
  • 136
  • 219
0
votes
1 answer

AVX2 SIMD addition not working

I am trying to add this two vectors using AVX2 SIMD instruction. The code compiles with no error & warning, but crashes when run. Why? It should print the result of SIMD addition with AVX2 no matter how large the array is which is initialized in…
K.Malu
  • 11
  • 10
0
votes
2 answers

Does the bitwise operation (&, ^. | etc) provided as operator overloads in the std::bitset use AVX or SSE4 instructions?

Since this is implementation dependent, is the only way to find that out is through the disassembly?
0
votes
3 answers

Converting from Source-based Indices to Destination-based Indices

I'm using AVX2 instructions in some C code. The VPERMD instruction takes two 8-integer vectors a and idx and generates a third one, dst, by permuting a based on idx. This seems equivalent to dst[i] = a[idx[i]] for i in 0..7. I'm calling this source…
eyepatch
  • 85
  • 6