Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

  • Scalar -> Vector register broadcast
  • Gather loads for loading a vector from different memory locations.
  • Masked memory loads/stores
  • New permute instructions
  • Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions
0
votes
0 answers

SIMD (AVX2) mask store and pack

I am trying to perform the following operation in AVX2 code (dest, data, and mask are int32 pointers): int j=0; for(i=0; i
nineties
  • 423
  • 1
  • 7
  • 17
0
votes
1 answer

AVX, Horizontal Sum of Single Precision Complex Numbers?

I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient. How can…
user1777820
  • 728
  • 9
  • 29
0
votes
0 answers

Expanding uint32 to YMM register efficiently with intel intrinsics

What I am trying to implement is a way to broadcast a 32bit integer to a 256bit YMM register in C effectively using intel intrinsics. The twist is however, that I want each bit of the 32bit integer to be translated into either a 0x00 or 0xFF byte…
oPolo
  • 516
  • 3
  • 14
0
votes
1 answer

Multidimensional __m256i datatype alignment issues

I hope someone is able to help with this issue, which has been bothering me for over an hour now. I have this code (it is in C): #include void test_vectors(__m256i state[5][2]); void test() { __m256i state[5][2]; for (int i…
oPolo
  • 516
  • 3
  • 14
0
votes
2 answers

Why the speedup is lower than expected by using AVX2?

I have vectorized the the inner loop of matrix addition using intrinsics instruction of AVX2, I also have the latency table from here. I expect that speedup should be a factor of 5, because almost 4 latency happens in 1024 iterations over 6 latency…
ADMS
  • 117
  • 3
  • 18
0
votes
1 answer

Why this code section return "Segmentation fault" error?

I'm vectorizing a part of my program but it returns Segmentation fault error. What is wrong with this? Here it is the simplified section, that cause the problem. j++ and i++ is exactly what I want, I do not want to be j += 16. unsigned short int…
ADMS
  • 117
  • 3
  • 18
0
votes
1 answer

AVX2 __m256i const* mem_addr in load instructions vs AVX

I can not load or store with AVX2 intrinsics instructions as I've done in AVX before. No error, just warnings, and it does not perform the load/store instruction at run-time. Other AVX2 instructions work properly but I can not load from memory. As…
ADMS
  • 117
  • 3
  • 18
0
votes
1 answer

MSVC 2015 AVX2 debugging problems. Not all SIMD lanes are populated correctly

I'm having trouble debugging my AVX2 code in Visual Studio 2015, update 1 (targeting Win10). When using the debugger and inspecting an AVX2 register, the contents differs when using a breakpoint and stepping over the _mm256_insertf128_ps-intrinsic…
repstosq
  • 3
  • 1
0
votes
0 answers

gcc optimization produce slower code

I am trying to compile following code using gcc 4.8.2, If I compile it with g++ -mavx2 -O0 10bit.cpp I get following output from time command: real 0m0.117s user 0m0.116s sys 0m0.000s but when I enable optimization g++ -mavx2 -O3…
Masoud
  • 29
  • 3
0
votes
1 answer

inline assembly + pointer management

I am very new concerning the usage of inline assembly in C++ codes. What I want to do is basicly a kind of memcopy for pointer with a size modulo 32. In C++ the code use to be something like this : void my_memcpy(const std::uint8_t* in,std::uint8_t*…
John_Sharp1318
  • 939
  • 8
  • 19
0
votes
1 answer

For some reason serial code runs faster than SIMD code

For some reason running the simple serial code for(i=0;i<1152*1152;i++){ MatrixA3[i] = MatrixA1[i] + z*MatrixA2[i];} runs faster than or same speed with the vectorized equivalent; for (int i = 0; i < 1152*1152; i+=4){ load_data1 =…
0
votes
1 answer

How to examine a 256i (16-bit) vector to know if it contains any element greater than zero?

I am converting a vectorized code from SSE2 intrinsics to AVX2 intrinsics, and would like to know how to check if a 256i (16-bit) vector contains any element greater than zero or not. Below is the code used in the SSE2: int check2(__m128i vector1,…
MROF
  • 147
  • 1
  • 3
  • 9
0
votes
0 answers

`loop was not vectorized: subscript too complex` in Intel Fortran with OpenMP

I have an issue while trying to parallelize - with OpenMP - and vectorize a nested loop with ifort 14.0.2. Here's the loop: !$OMP DO schedule(auto) do ig1 = 1, N_g ic1 = (ig1-1) * N_d do ig2 = 1, N_t ig2index = T(ig2) kk = (ig2index-1) *…
bio
  • 501
  • 1
  • 5
  • 16
0
votes
1 answer

g++ -O2 incorrectly optimize out SIMD variable assignment

I'm writing a program using Intel AVX2 instructions. I found a bug in my program which appears only with optimization level -O2 or higher (With -O1 it's good). After extensive debugging, I narrow down the buggy region. Now the bug seems to be caused…
Neo1989
  • 285
  • 3
  • 14
0
votes
0 answers

VEXTRACTF128 versus VEXTRACTI128

As far as I can tell the VEXTRACTF128 and VEXTRACTI128 instructions do the same things, have the same latency, same throughput, and use the same ports. The only difference I cant tell between them is that VEXTRACTF128 only requires AVX VEXTRACTI128…
Z boson
  • 32,619
  • 11
  • 123
  • 226
1 2 3
45
46