Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.

Interesting Q&As / FAQs:

Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (__m256* types) require 32 bytes alignment, 512-bit memory operands (__m512* types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explains alignas, aligned_alloc, _aligned_malloc, C++17 aligned new, etc, and use of unaligned loadu / storeu intrinsics.
Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE works. , Includes in-lane vs. lane-crossing for AVX.
Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.

1252 questions

votes

1 answer

GCC inline SSE code

Something bugs me regarding the vector extensions. The document: Intel® Advanced Vector Extensions Programming Reference States: VPSRLD ymm1, ymm2, imm8 So I went ahead and: __asm__ ( "vpsrld %ymm0, %ymm0, $0x4" ); GCC 4.8.2-19ubuntu1 spits…

asked Jun 03 '14 at 14:10

Anders Cedronius

2,036
1
23
29

votes

1 answer

Using intrinsics to find next non-zero in an array

I have an int array[10000] and I want to iterate from a certain position to find the next non-zero index. Currently I use a basic while loop: while(array[i] == 0){ pos++; } etc I know with intrinsics I could test 4 integers for zero at a time,…

c++ performance vectorization sse avx

asked Apr 24 '14 at 16:41

user997112

29,025
43
182
361

votes

1 answer

Check for zeros horizontally across __m128i vector?

I have several __m128i vectors containing 32-bit unsigned integers and I would like to check whether any of the 4 integers is a zero. I understand how I can "aggregate" the multiple __m128i vectors but eventually I will still end up with a single…

c++ intel vectorization sse avx

asked Apr 21 '14 at 21:34

user997112

29,025
43
182
361

votes

0 answers

Implement Multiply and adding 2 matrix by avx programming

I want to implement multiply and adding 2 matrices in Visual C++ 2012 using AVX. I enable AVX(Advanced Vector Extensions (/arch:AVX)) in Visual studio. But for adding matrices when I enable this property and when I disable it, the time is same and…

c++ visual-studio-2012 matrix avx

asked Dec 03 '13 at 10:52

user2855778

votes

1 answer

How to specify the CFLAGS to gcc-4.6 or gcc-4.7 to use the Intel-AVX

I have an Intel Core i7-3770, and I found that it contains the AVX, How do I specify the CFLAGS to gcc-4.6 or gcc-4.7 to use the Intel-AVX? Is there some example code or manual about this? Thanks.

gcc intel avx

asked Oct 02 '13 at 05:53

mining

3,557
5
39
66

votes

2 answers

Using AVX with GCC: __builtin_ia32_addpd256 not declared

If I #include I get this error: error: '__builtin_ia32_addpd256' was not declared in this scope I have defined __AVX__ and __FMA__ macros to make AVX avilable, but apparently this isn't enough. There is no error if I use compiler…

c++ gcc avx fma

asked Sep 18 '13 at 08:30

Violet Giraffe

32,368
48
194
335

votes

2 answers

Avoiding unnecessary loads (SSE/AVX)

When compiled for x64, the following function uses the XMM0 register for parameter passing: void foo (double const scalar) { __m256d vector = _mm256_broadcast_sd(&scalar); } In assembly, the vbroadcastsd opcode can take a register operand. The…

c++ sse avx

asked Sep 10 '13 at 11:39

linguamachina

5,785
1
22
22

votes

1 answer

C++ convert SSE code to AVX

With the help of YOU, I have used SSE in my code (sample below) with significant performance boost and I was wondering if this boost could be improved by using 256bit registers of AVX. int result[4] __attribute__((aligned(16))) = {0}; __m128i…

c++ sse cpu-registers avx

asked Sep 03 '13 at 08:48

Alexandros

2,160
4
27
52

votes

1 answer

why do the SSE and AVX have same efficiency?

I use vs2012 and want to test the efficiency of SSE and AVX. The code for SSE and AVX is almost the same, except the SSE uses _m128 and AVX uses _m256. I expected the AVX code to be two times faster then the SSE code, But the test result shows…

c++ performance visual-studio-2012 sse avx

asked Aug 30 '13 at 10:19

myej

votes

1 answer

32B chunks, contiguous and non-contiguous memory accesses

I wrote a matrix-matrix(32bit floats) multiplication function in C++ using intrinsics for large matrices(8192x8192), minimum data size is 32B for every read and write operation. I will change the algorithm into a blocking one such that it reads a…

c++ memory intrinsics avx contiguous

asked Jul 27 '13 at 20:28

huseyin tugrul buyukisik

11,469
4
45
97

votes

1 answer

G++ Asm inline: register clobbering

Does gcc compiler use push/pop for register backup if I dont write anything in clobber list? What happens for input and output list registers? I will make a short asm inline that saves some general purpose registers to XMM/YMM registers then plays…

assembly g++ sse inline-assembly avx

asked Jul 07 '13 at 15:02

huseyin tugrul buyukisik

11,469
4
45
97

votes

1 answer

AVX and Bubble Sort

I have to develop a bubble sort algorithm with AVX instructions with single precision numbers in input. Can anyone help me to look for the best implementation? I did a bubble sort version for SSE3: global sort32 sort32: start mov eax, [ebp+8] …

assembly x86 nasm avx sse3

asked Jul 01 '13 at 14:53

Frank

votes

2 answers

FLT_EPSILON for a nth root finder with SSE/AVX

I'm trying to convert a function that finds the nth root in C for a double value from the following link http://rosettacode.org/wiki/Nth_root#C to find the nth root for 8 floats at once using AVX. Part of that code uses DBL_EPSILON * 10. However,…

c floating-point sse avx

asked Jun 14 '13 at 13:56

user2088790

votes

0 answers

How to use Intel AVX on QNX Neutrino 6.5.0?

I recently started working with QNX 6.5.0 and can't understand how in QNX develop programs using Intel AVX. Installed QNX Development Studio 6.5.0 with GCC 4.4.2, I'm trying to write a simple program, but the build fails. #include int…

avx qnx qnx-neutrino

asked Jun 05 '13 at 09:26

Ildar

votes

1 answer

Performing AVX integer operation

I'm trying to optimize some integer (_int64) operations using AVX. However, I can't even simple add operation. It keeps telling me illegal instruction. Pls can I be corrected on what i'm doing wrong? Thanks for (int i = 0; i < 1; i+=4) { __m256i…

c++ avx

asked Apr 29 '13 at 17:02

FrancFine

Prev 1 2 3

…

83 84 Next