Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.

Interesting Q&As / FAQs:

Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (__m256* types) require 32 bytes alignment, 512-bit memory operands (__m512* types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explains alignas, aligned_alloc, _aligned_malloc, C++17 aligned new, etc, and use of unaligned loadu / storeu intrinsics.
Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE works. , Includes in-lane vs. lane-crossing for AVX.
Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.

1252 questions

votes

2 answers

FMA3 in GCC: how to enable

I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile. SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp AVX:…

asked Jan 08 '14 at 16:37

Z boson

32,619
11
123
226

votes

6 answers

How to use the Intel AVX in Java?

How do I use the Intel AVX vector instruction set from Java? It's a simple question but the answer seems to be hard to find.

java simd avx

asked Dec 27 '14 at 09:17

Albert Hendriks

1,979
3
25
45

votes

5 answers

Transpose an 8x8 float using AVX/AVX2

Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for. In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However,…

simd avx avx2

asked Sep 02 '14 at 11:51

DavidS

1,660
1
12
26

votes

5 answers

How to combine two m128 values to m256?

I would like to combine two __m128 values to one __m256. Something like this: __m128 a = _mm_set_ps(1, 2, 3, 4); __m128 b = _mm_set_ps(5, 6, 7, 8); to something like: __m256 c = { 1, 2, 3, 4, 5, 6, 7, 8 }; are there any intrinsics that I can…

c x86 sse simd avx

asked Jun 20 '12 at 09:40

user1468756

votes

2 answers

Choice between aligned vs. unaligned x86 SIMD instructions

There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr…

x86 sse simd avx avx512

asked Sep 03 '18 at 09:57

MikeF

1,021
9
29

votes

3 answers

How to solve the 32-byte-alignment issue for AVX load/store operations?

I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example: #include #include inline void ones(float *a) { __m256 out_aligned =…

c++ sse simd memory-alignment avx

asked Sep 16 '15 at 14:57

romeric

2,325
3
19
35

votes

2 answers

Measuring memory bandwidth from the dot product of two arrays

The dot product of two arrays for(int i=0; i

c++ memory openmp bandwidth avx

asked Aug 07 '14 at 10:08

Z boson

32,619
11
123
226

votes

5 answers

Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

Modern x86_64 linux with glibc will detect that CPU has support of AVX extension and will switch many string functions from generic implementation to AVX-optimized version (with help of ifunc dispatchers: 1, 2). This feature can be good for…

linux linker gdb glibc avx

asked Feb 25 '17 at 03:01

osgx

90,338
53
357
513

votes

5 answers

SSE-copy, AVX-copy and std::copy performance

I'm tried to improve performance of copy operation via SSE and AVX: #include const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); …

c++ performance sse simd avx

asked Aug 19 '13 at 13:04

gorill

1,623
3
20
29

votes

2 answers

How to sum __m256 horizontally?

I would like to horizontally sum the components of a __m256 vector using AVX instructions. In SSE I could use _mm_hadd_ps(xmm,xmm); _mm_hadd_ps(xmm,xmm); to get the result at the first component of the vector, but this does not scale with the 256…

sse vectorization intrinsics avx

asked Nov 04 '12 at 13:55

Yoav

5,962
5
39
61

votes

4 answers

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the…

x86 simd avx vector-processing avx2

asked Mar 20 '12 at 21:48

Luigi Castelli

votes

2 answers

Half-precision floating-point arithmetic on Intel chips

Is it possible to perform half-precision floating-point arithmetic on Intel chips? I know how to load/store/convert half-precision floating-point numbers [1] but I do not know how to add/multiply them without converting to single-precision…

x86 intel avx floating-point-conversion half-precision-float

asked Apr 24 '18 at 07:19

Kadir

1,345
3
15
25

votes

1 answer

Is it worth bothering to align AVX-256 memory stores?

According to the Intel® 64 and IA-32 Architectures Optimization Reference Manual, section B.4 ("Performance Tuning Techniques for Intel® Microarchitecture Code Name Sandy Bridge"), subsection B.4.5.2 ("Assists"): 32-byte AVX store instructions that…

performance assembly x86-64 memory-alignment avx

asked Jun 16 '17 at 09:59

Maxim Masiutin

3,991
4
55
72

votes

2 answers

How to rotate an SSE/AVX vector

I need to perform a rotate operation with as little clock cycles as possible. In the first case let's assume __m128i as source and dest type: source: || A0 || A1 || A2 || A3 || dest: || A1 || A2 || A3 || A0 || dest =…

c x86 sse intrinsics avx

asked Aug 10 '12 at 17:52

user1584773

votes

2 answers

Reference manual/tutorial for x86 SIMD intrinsics?

I'm looking into using these to improve the performance of some code but good documentation seems hard to find for the functions defined in the *mmintrin.h headers, can anybody provide me with pointers to good info on these? EDIT: particularly…

simd sse intrinsics avx

asked Jul 28 '11 at 11:03

BD at Rivenhill

12,395
10
46
49

Prev 1 2

…

83 84 Next