Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

Scalar -> Vector register broadcast
Gather loads for loading a vector from different memory locations.
Masked memory loads/stores
New permute instructions
Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions

votes

1 answer

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

asked Sep 12 '19 at 02:22

Sly14

votes

1 answer

AVX2 appears not to be defined in eclipse-cdt

My compiler supports avx2, and I added -mavx2 to C++ flags, but the __AVX2__ macro is not defined in my code. #ifdef __AVX2__ #include #endif appears to be disabled in the code. Edit: My complier version is: g++ (Ubuntu…

gcc g++ eclipse-cdt intrinsics avx2

asked Sep 05 '19 at 12:33

Jacko

12,665
18
75
126

votes

1 answer

Setting proper alignment of packed long long on GCC to use use with avx2 instructions

Introduction: I'm writing a function to process 4 packed long long int in x86_64 assembly using AVX2 instruction. Here is how my header file looks like: avx2.h #define AVX2_ALIGNMENT 32 // Processes 4 packed long long int and // returns a pointer…

c language-lawyer x86-64 memory-alignment avx2

asked Aug 10 '19 at 10:34

St.Antario

26,175
41
130
318

votes

1 answer

String reverse with x64 SSE / AVX registers

I'm trying to write SIMD assembly instructions to reverse a string between 16 and 32 bytes of length. The below reverses a string exactly 32 bytes long but doesn't take care of anything shorter. Is there an AVX / SSE way of doing this better in a…

assembly x86-64 sse avx2

asked Aug 08 '19 at 00:52

kr1tzy

votes

1 answer

How to use fused multiply and add in AVX for 16 bit packed integers

I know there it is possible to do multiply-and-add using a single instruction in AVX2. I want to use multiply-and-add instruction where each 256-bit AVX2 variable is packed with 16, 16-bit variables. For instance, consider the example…

c performance intel avx2 fma

asked Jul 31 '19 at 09:11

Rick

votes

0 answers

Segmentation fault when returning a pointer to an array of __m256d

I was trying the Intel Intrinsic AVX2 datatype and functions.Unlike most codes found on the web which concentrate on looping on 256-bit segments of data on arrays,I tried to create an array of __m256d data. The code works for trying to load all…

c++ segmentation-fault avx2

asked Jul 16 '19 at 13:17

Amirrad

votes

0 answers

AVX2 - method is 14x slower over a classic version

I have rewritten logaritmic function from http://gruntthepeon.free.fr/ssemath/ to be used with doubles and AVX2. However, entire function is 14x slower (15s) than regular C/C++ version (1.1s). When I comment all ines, that use _mm256_sub_pd, AVX2…

c++ performance avx2

asked Jun 29 '19 at 10:47

Martin Perry

9,232
8
46
114

votes

1 answer

Running shell scripts from Vtune Amplifier

I am new to VTune and trying to profile an application. I want to call the executable using a shell script as there are many parameters and quite long too. How can I do it?

optimization avx2 intel-vtune

asked Jun 19 '19 at 14:32

prajjwal_jha

votes

1 answer

vector instructions ("vcl" and "ume") for counting sort

I'm trying vector instruction using libraries "vcl" and "ume" for a kind of counting sort, which gives only the position back // icpc sort.cpp -xCORE_AVX2 -o c #include #include #include #include…

c++ histogram simd avx2 counting-sort

asked Jun 08 '19 at 04:52

mimi

votes

0 answers

.NET Optimizing operations on an array of mathematical vectors with SIMD

I developed a game, where periodically vectors are added to each others. For instance: position += movement; is the movement which is done on every tick of the game for every unit in the playfield. A vector looks like this, ofc with additional…

c# simd avx2 .net-core-3.0

asked Jun 04 '19 at 10:21

Matthias

votes

1 answer

When I test the cycle number of the module, the results of each test are quite different。

When I test the cycle number of the module, the results of each test are quite different？ 1781344-->First test 1264558-->Second test 1388058-->Third test I use __rdtsc() to record cycles，and use AVX512 intrinsic。 Are there any methods to make the…

benchmarking intel avx2 avx512 rdtsc

asked Apr 25 '19 at 03:18

yueluojieying

votes

0 answers

Optimized way to perform AVX2 VPXOR and popcount in minimum clock cycles

We have to perform bit wise XOR operation on two arrays each containing 5 elements of uint64_t (unsigned long long) and then perform counting (pop count) of 1's. What is the optimized way by using AVX2 256 bit wide YMM registers, AVX2 VPXOR and…

c++ x86 simd avx2 hammingweight

asked Mar 27 '19 at 10:08

Muhammad Junaid

votes

0 answers

Going out of bounds in an AVX2 register

Say I have this piece of code: __m256i i1, i2, i3; memcpy(&i1, p + offsets[0], n); memcpy(&i2, p + offsets[1], n); memcpy(&i3, p + offsets[2], n); // etc And n is set greater than 32. I know bad things will happen to me - but I've not actually…

c++ undefined-behavior simd avx2

asked Jan 08 '19 at 18:51

Owen Morgan

votes

0 answers

Are masked FP multiplications improve performance in AVX512?

AVX512 has several/most floating point instructions available in masked form, where you can select which results will be changed/zeroed. Do the CPUs actually use this info schedule which say multiplications should be performed, or does this merely…

vectorization avx2 avx512

asked Oct 18 '18 at 07:59

Vojtěch Melda Meluzín

1,117
3
11
22

votes

1 answer

How to use _mm256_log_ps by leveraging Intel OpenCL SVML?

I found that _mm256_log_ps can't be used with GCC7. Most common suggestions on stackoverflow is to use ICC or leveraging OpenCL SDK. After downloading SDK and extracting RPM file, there are three .so files: __ocl_svml_l9.so, __ocl_svml_e9.so,…

gcc opencl avx avx2

asked Aug 11 '18 at 05:00

user2131907

Prev 1 2 3

…

45 46 Next