Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

  • Scalar -> Vector register broadcast
  • Gather loads for loading a vector from different memory locations.
  • Masked memory loads/stores
  • New permute instructions
  • Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions
0
votes
1 answer

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
Sly14
  • 3
  • 2
0
votes
1 answer

__AVX2__ appears not to be defined in eclipse-cdt

My compiler supports avx2, and I added -mavx2 to C++ flags, but the __AVX2__ macro is not defined in my code. #ifdef __AVX2__ #include #endif appears to be disabled in the code. Edit: My complier version is: g++ (Ubuntu…
Jacko
  • 12,665
  • 18
  • 75
  • 126
0
votes
1 answer

Setting proper alignment of packed long long on GCC to use use with avx2 instructions

Introduction: I'm writing a function to process 4 packed long long int in x86_64 assembly using AVX2 instruction. Here is how my header file looks like: avx2.h #define AVX2_ALIGNMENT 32 // Processes 4 packed long long int and // returns a pointer…
St.Antario
  • 26,175
  • 41
  • 130
  • 318
0
votes
1 answer

String reverse with x64 SSE / AVX registers

I'm trying to write SIMD assembly instructions to reverse a string between 16 and 32 bytes of length. The below reverses a string exactly 32 bytes long but doesn't take care of anything shorter. Is there an AVX / SSE way of doing this better in a…
kr1tzy
  • 138
  • 2
  • 14
0
votes
1 answer

How to use fused multiply and add in AVX for 16 bit packed integers

I know there it is possible to do multiply-and-add using a single instruction in AVX2. I want to use multiply-and-add instruction where each 256-bit AVX2 variable is packed with 16, 16-bit variables. For instance, consider the example…
Rick
  • 361
  • 5
  • 17
0
votes
0 answers

Segmentation fault when returning a pointer to an array of __m256d

I was trying the Intel Intrinsic AVX2 datatype and functions.Unlike most codes found on the web which concentrate on looping on 256-bit segments of data on arrays,I tried to create an array of __m256d data. The code works for trying to load all…
Amirrad
  • 78
  • 6
0
votes
0 answers

AVX2 - method is 14x slower over a classic version

I have rewritten logaritmic function from http://gruntthepeon.free.fr/ssemath/ to be used with doubles and AVX2. However, entire function is 14x slower (15s) than regular C/C++ version (1.1s). When I comment all ines, that use _mm256_sub_pd, AVX2…
Martin Perry
  • 9,232
  • 8
  • 46
  • 114
0
votes
1 answer

Running shell scripts from Vtune Amplifier

I am new to VTune and trying to profile an application. I want to call the executable using a shell script as there are many parameters and quite long too. How can I do it?
0
votes
1 answer

vector instructions ("vcl" and "ume") for counting sort

I'm trying vector instruction using libraries "vcl" and "ume" for a kind of counting sort, which gives only the position back // icpc sort.cpp -xCORE_AVX2 -o c #include #include #include #include…
mimi
  • 3
  • 4
0
votes
0 answers

.NET Optimizing operations on an array of mathematical vectors with SIMD

I developed a game, where periodically vectors are added to each others. For instance: position += movement; is the movement which is done on every tick of the game for every unit in the playfield. A vector looks like this, ofc with additional…
Matthias
  • 948
  • 1
  • 6
  • 25
0
votes
1 answer

When I test the cycle number of the module, the results of each test are quite different。

When I test the cycle number of the module, the results of each test are quite different? 1781344-->First test 1264558-->Second test 1388058-->Third test I use __rdtsc() to record cycles,and use AVX512 intrinsic。 Are there any methods to make the…
0
votes
0 answers

Optimized way to perform AVX2 VPXOR and popcount in minimum clock cycles

We have to perform bit wise XOR operation on two arrays each containing 5 elements of uint64_t (unsigned long long) and then perform counting (pop count) of 1's. What is the optimized way by using AVX2 256 bit wide YMM registers, AVX2 VPXOR and…
0
votes
0 answers

Going out of bounds in an AVX2 register

Say I have this piece of code: __m256i i1, i2, i3; memcpy(&i1, p + offsets[0], n); memcpy(&i2, p + offsets[1], n); memcpy(&i3, p + offsets[2], n); // etc And n is set greater than 32. I know bad things will happen to me - but I've not actually…
Owen Morgan
  • 494
  • 4
  • 6
0
votes
0 answers

Are masked FP multiplications improve performance in AVX512?

AVX512 has several/most floating point instructions available in masked form, where you can select which results will be changed/zeroed. Do the CPUs actually use this info schedule which say multiplications should be performed, or does this merely…
Vojtěch Melda Meluzín
  • 1,117
  • 3
  • 11
  • 22
0
votes
1 answer

How to use _mm256_log_ps by leveraging Intel OpenCL SVML?

I found that _mm256_log_ps can't be used with GCC7. Most common suggestions on stackoverflow is to use ICC or leveraging OpenCL SDK. After downloading SDK and extracting RPM file, there are three .so files: __ocl_svml_l9.so, __ocl_svml_e9.so,…
user2131907
  • 342
  • 1
  • 6
  • 14