Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.

Interesting Q&As / FAQs:

Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (__m256* types) require 32 bytes alignment, 512-bit memory operands (__m512* types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explains alignas, aligned_alloc, _aligned_malloc, C++17 aligned new, etc, and use of unaligned loadu / storeu intrinsics.
Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE works. , Includes in-lane vs. lane-crossing for AVX.
Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.

1252 questions

votes

0 answers

Multiplication with AVX

Please this is my first time of using AVX and I'm trying to perform a simple multiplication on double precision numbers but I'm not getting all results correct. I get just the first 4 results and the others are jargon. #include…

c++ avx

asked Mar 15 '13 at 05:41

FrancFine

votes

1 answer

Minimum of signed/unsigned integers using AVX

I was looking through the AVX instruction guide and though there are load, store and permute operations for 32-bit integer values, other operations such as determining minimum or maximum values, or shuffle operations are present only for floats and…

c sse avx

asked Dec 13 '12 at 22:24

user1715122

votes

1 answer

mfpmath option to MinGW (or even gcc)

Does the -march=corei7-avx -mtune=corei7-avx or -march=corei7 -mtune=corei7 -mavx command line options to MinGW with the -mfpmath=sse command line option (or even with -mfpmath=both) enables using of AVX instruction for math routines? Note, that…

gcc mingw sse avx

asked Dec 06 '12 at 05:11

Tomilov Anatoliy

15,657
10
64
169

-1

votes

0 answers

Usage of _mm_loadu_epi8 leads to error - ‘_mm_loadu_epi8’ was not declared in this scope

While trying to load _mm_loadu_epi8 instruction which is defined in AVX512 family of Intel Intrinsics instruction was getting error in c++ that - Usage of _mm_loadu_epi8 leads to error - ‘_mm_loadu_epi8’ was not declared in this scope. Tried to use…

c++ intrinsics avx avx512

asked Aug 14 '23 at 10:51

Srihari S

-1

votes

2 answers

Does anyone know of a fix for an MSVC compiler bug/annoyance where SIMD Extension settings get "stuck" on AVX?

Does anyone know of a fix for an MSVC compiler bug/annoyance where SIMD Extension settings get "stuck" on AVX? The context of this question is coding up SIMD CPU dispatchers, closely following Agner's well-known dispatch_example2.cpp project. I've…

c++ visual-c++ simd avx vector-class-library

asked Jan 06 '22 at 19:25

dts

-1

votes

1 answer

How to detect AVX2 support using gcc

I need to detect AVX2 support in my code take decisions accordingly. I am aware of two methods - __builtin_cpu_supports("avx2") and #if defined(__AVX2__). Now the issue is one returns true and another false. The test code is as follows - int…

gcc g++ avx instruction-set avx2

asked May 16 '21 at 12:32

Atharva Dubey

-1

votes

1 answer

Count integers in an array where the set bits are a subset of a given mask

Given a mask and a value, the mask covers the value if all bits from the value fall into the mask. For example: mask: 0b011010 value: 0b010010 true or mask: 0b011010 value: 0b010110 false For int arr[arr_size], I need to calculate how many…

c++ optimization sse avx bitmask

asked Jan 31 '21 at 15:35

Zhihar

1,306
1
22
45

-1

votes

1 answer

Removing multiple _mm256_blend_ps decreases performance instead of increasing it

I am writing a small template library to transpose arbitrary matrices using AVX intrinsics. Since I am heavily using if constexpr and templates I wanted to make sure, that the compiler is applying all the optimization I expect and benchmarked my…

c++ performance simd avx

asked Mar 07 '20 at 23:43

wychmaster

-1

votes

1 answer

How to improve Mersenne Twister vor AVX/SSE?

Today i have started a project having the goal to optimize the generation of random numbers. I want to wipe several hard drives, using the Mersenne Twister PRNG, but unfortunately i'm only able to produce around 200MB/s of random data, on 8 hard…

c optimization random vectorization avx

asked Feb 06 '20 at 12:46

Fabian Druschke

-1

votes

2 answers

How to create a 8 bit mask from lsb of __m64 value?

I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1}; I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8…

c++ simd avx avx2 mmx

asked Aug 30 '18 at 11:40

yadhu

1,253
14
25

-1

votes

1 answer

print out the content of __m256i variable

I am trying to print out the value of an __m256i variable but I get a run-time error (file.exe has stopped working!). My CPU is Intel and supports AVX instructions. When I comment the cout line, the code runs. I am using Intel C++ compiler. what is…

c++ windows x86 simd avx

asked Mar 07 '18 at 13:38

Farhad

-1

votes

1 answer

avx slower then sse multimedia extensions

I am programming a perfect program to parallelize with multimedia extensions. The program consists of transforming an image, so i go over a matrix and i modify each pixel inside it. For go over faster, i use multimedia extensions: At first i used…

sse cpu-architecture hpc avx avx512

asked Nov 17 '16 at 12:22

Paco Muñoz Martinez

-1

votes

1 answer

sse and avx performance on Sandybridge and IvyBridge

I am benchmarking a set of applications on a SandyBridge processor (i7-3820). The benchmark consists of two different versions. These two versions contain the same code with the only difference that the first version uses sse/sse2 instrinsics and…

visual-studio-2015 sse simd avx

asked Jun 27 '16 at 16:38

Giannis

-1

votes

1 answer

Align double vs align float for AVX operations

I want to multiply two (float/double) vectors with AVX operators. In order to do that, I need aligned memory. My function for float values is: #define SIZE 65536 float *g, *h, *j; g = (float*)aligned_alloc(32, sizeof(float)*SIZE); h =…

c++ memory-alignment avx

asked May 25 '16 at 09:19

arc_lupus

3,942
5
45
81

-1

votes

2 answers

nvcc with avx support cannot find gcc builtin intrinsics

This is my first question ;-) I try to use AVX in CUDA application (ccminer) but nvcc shows an error: /usr/local/cuda/bin/nvcc -Xcompiler "-Wall -mavx" -O3 -I . -Xptxas "-abi=no -v" -gencode=arch=compute_50,code=\"sm_50,compute_50\"…

c linux cuda nvcc avx

asked Oct 10 '14 at 14:18

Marcin Badtke

Prev 1 2 3

…

84 Next