Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.


Interesting Q&As / FAQs:

1252 questions
0
votes
0 answers

Multiplication with AVX

Please this is my first time of using AVX and I'm trying to perform a simple multiplication on double precision numbers but I'm not getting all results correct. I get just the first 4 results and the others are jargon. #include…
FrancFine
  • 27
  • 3
0
votes
1 answer

Minimum of signed/unsigned integers using AVX

I was looking through the AVX instruction guide and though there are load, store and permute operations for 32-bit integer values, other operations such as determining minimum or maximum values, or shuffle operations are present only for floats and…
user1715122
  • 947
  • 1
  • 11
  • 26
0
votes
1 answer

mfpmath option to MinGW (or even gcc)

Does the -march=corei7-avx -mtune=corei7-avx or -march=corei7 -mtune=corei7 -mavx command line options to MinGW with the -mfpmath=sse command line option (or even with -mfpmath=both) enables using of AVX instruction for math routines? Note, that…
Tomilov Anatoliy
  • 15,657
  • 10
  • 64
  • 169
-1
votes
0 answers

Usage of _mm_loadu_epi8 leads to error - ‘_mm_loadu_epi8’ was not declared in this scope

While trying to load _mm_loadu_epi8 instruction which is defined in AVX512 family of Intel Intrinsics instruction was getting error in c++ that - Usage of _mm_loadu_epi8 leads to error - ‘_mm_loadu_epi8’ was not declared in this scope. Tried to use…
Srihari S
  • 17
  • 4
-1
votes
2 answers

Does anyone know of a fix for an MSVC compiler bug/annoyance where SIMD Extension settings get "stuck" on AVX?

Does anyone know of a fix for an MSVC compiler bug/annoyance where SIMD Extension settings get "stuck" on AVX? The context of this question is coding up SIMD CPU dispatchers, closely following Agner's well-known dispatch_example2.cpp project. I've…
dts
  • 125
  • 1
  • 10
-1
votes
1 answer

How to detect AVX2 support using gcc

I need to detect AVX2 support in my code take decisions accordingly. I am aware of two methods - __builtin_cpu_supports("avx2") and #if defined(__AVX2__). Now the issue is one returns true and another false. The test code is as follows - int…
Atharva Dubey
  • 832
  • 1
  • 8
  • 25
-1
votes
1 answer

Count integers in an array where the set bits are a subset of a given mask

Given a mask and a value, the mask covers the value if all bits from the value fall into the mask. For example: mask: 0b011010 value: 0b010010 true or mask: 0b011010 value: 0b010110 false For int arr[arr_size], I need to calculate how many…
Zhihar
  • 1,306
  • 1
  • 22
  • 45
-1
votes
1 answer

Removing multiple _mm256_blend_ps decreases performance instead of increasing it

I am writing a small template library to transpose arbitrary matrices using AVX intrinsics. Since I am heavily using if constexpr and templates I wanted to make sure, that the compiler is applying all the optimization I expect and benchmarked my…
wychmaster
  • 712
  • 2
  • 8
  • 23
-1
votes
1 answer

How to improve Mersenne Twister vor AVX/SSE?

Today i have started a project having the goal to optimize the generation of random numbers. I want to wipe several hard drives, using the Mersenne Twister PRNG, but unfortunately i'm only able to produce around 200MB/s of random data, on 8 hard…
-1
votes
2 answers

How to create a 8 bit mask from lsb of __m64 value?

I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1}; I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8…
yadhu
  • 1,253
  • 14
  • 25
-1
votes
1 answer

print out the content of __m256i variable

I am trying to print out the value of an __m256i variable but I get a run-time error (file.exe has stopped working!). My CPU is Intel and supports AVX instructions. When I comment the cout line, the code runs. I am using Intel C++ compiler. what is…
Farhad
  • 29
  • 1
  • 6
-1
votes
1 answer

avx slower then sse multimedia extensions

I am programming a perfect program to parallelize with multimedia extensions. The program consists of transforming an image, so i go over a matrix and i modify each pixel inside it. For go over faster, i use multimedia extensions: At first i used…
-1
votes
1 answer

sse and avx performance on Sandybridge and IvyBridge

I am benchmarking a set of applications on a SandyBridge processor (i7-3820). The benchmark consists of two different versions. These two versions contain the same code with the only difference that the first version uses sse/sse2 instrinsics and…
-1
votes
1 answer

Align double vs align float for AVX operations

I want to multiply two (float/double) vectors with AVX operators. In order to do that, I need aligned memory. My function for float values is: #define SIZE 65536 float *g, *h, *j; g = (float*)aligned_alloc(32, sizeof(float)*SIZE); h =…
arc_lupus
  • 3,942
  • 5
  • 45
  • 81
-1
votes
2 answers

nvcc with avx support cannot find gcc builtin intrinsics

This is my first question ;-) I try to use AVX in CUDA application (ccminer) but nvcc shows an error: /usr/local/cuda/bin/nvcc -Xcompiler "-Wall -mavx" -O3 -I . -Xptxas "-abi=no -v" -gencode=arch=compute_50,code=\"sm_50,compute_50\"…
Marcin Badtke
  • 599
  • 5
  • 9
1 2 3
83
84