Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.

Interesting Q&As / FAQs:

Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (__m256* types) require 32 bytes alignment, 512-bit memory operands (__m512* types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explains alignas, aligned_alloc, _aligned_malloc, C++17 aligned new, etc, and use of unaligned loadu / storeu intrinsics.
Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE works. , Includes in-lane vs. lane-crossing for AVX.
Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.

1252 questions

votes

0 answers

clang in Xcode 7.2 generates vxorps

I encountered an issue where compiling cryptopp with clang from Xcode 7.2 generates a vxorps instruction in ByteQueue::ByteQueue(unsigned long). Since our product can be run on old CPUs where this instruction triggers illegal instruction I need to…

clang avx xcode7.2

asked Mar 13 '18 at 15:16

Rudolfs Bundulis

11,636
6
33
71

votes

1 answer

How to extract an array of properties out of an array of objects?

Imagine that i have an array of objects, like this: class Segment { public: float x1, x2, y1, y2; } Segment* SegmentList[100]; Based on this array of Segments, I want to quickly extract its properties and create vectors with all the x1, x2, y1…

c++ arrays avx

asked Mar 05 '18 at 21:53

Alkin

votes

0 answers

Anaconda Tensorflow Compiler Issue CPU AVX AVX2

I installed Tensorflow via Anaconda, I tried testing if it works using the short program on the website, but I ended up with this error. Is there something wrong, or is it my CPU can't handle it? Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018,…

tensorflow cpu conda avx avx2

asked Feb 22 '18 at 23:49

Jamal Abdi

votes

0 answers

Is there a penalty for mixing x86-64 integer instructions with AVX1/2/512 instructions?

I have seen a lot of assembly with AVX(all three flavors), and in all the cases that I have seen the most concentrated a kind of instruction is the best the code performs. But, for example, things like doing a load into a 32-bit register and then…

performance x86 avx avx2 avx512

asked Feb 19 '18 at 07:52

JLV

votes

0 answers

How to extract data from xmm register by index stored in another register r10 to store that dword in eax?

I need to extract dword from XMM1, which is located at index, stored in R10, and move it to EAX register. How to do that efficiently, not involving memory access? Following would not compile: PEXTRD EAX,XMM1,R10d

performance assembly x86 sse avx

asked Jan 22 '18 at 20:25

xakepp35

2,878
7
26
54

votes

1 answer

SSE/AVX - VMULPD produces all zeros for small integer inputs?

I'm using X64dbg to test SSE/AVX assembly instructions to better understand their behavior before using them to write code. I've been able to test the vmovapd, vbroadcastsd, vsubpd, and vaddpd instructions this way without issue. I loaded YMM…

assembly floating-point x86 avx

asked Nov 15 '17 at 00:14

Gogeta70

votes

0 answers

C AVX2 sum array horizontaly

I have some problems with AVX2 instructions. I wrote a program in c which read a binary file with unsigned chars then sum them. Now i want to replace the c for loop with AVX2 instructions but it doesnt work. Thats the first time i want to use AVX2.…

c arrays avx avx2

asked Nov 07 '17 at 16:16

AsdFork

votes

2 answers

Why does this AVX intrinsic cause "Segmentation fault" with clang, but not GCC?

It seems two functions below can cause segmentation fault when compiled with clang using -mavx (or -march=sandybridge -> skylake). void _mm256_mul_double_intrin(double* a, double* b, int N) { int nb_iters = N / ( sizeof(__m256d) / sizeof(double)…

c++ clang inline-assembly intrinsics avx

asked Oct 30 '17 at 08:14

sandthorn

2,770
1
15
59

votes

1 answer

Linker error GCC7 with -mavx flag

compiling 256 bit vector datatype (__m256d) from Intel's AVX extension with gcc7 or clang fails. I am able to compile and use 128 bit vectors (without -mavx flag). But as soon as I try the avx vectors either some assembler command definitions are…

macos clang linker-errors avx gcc7

asked Oct 16 '17 at 16:18

Marie Hoffmann

votes

1 answer

Conda install dlib AVX support

I've just installed dlib using conda from the conda-forge channel. Is it possibile to know whether it has been built with AVX support?

conda avx dlib

asked Oct 13 '17 at 08:59

se7entyse7en

4,310
7
33
50

votes

0 answers

Which x86 ISA extensions imply support for previous SIMD extensions?

My CPU supports the following technologies: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, and AVX. When I write my code and check for hardware support, can I assume things like "If the processor supports AVX, it ALWAYS supports all of these other…

x86 sse simd avx cpuid

asked Oct 06 '17 at 01:38

HesNotTheStig

votes

0 answers

AVX Command Error for integer addition

Has anyone know how to resolve these types of error? I am trying to add two 256-bit integer vector, but getting following error: cpu_avx.c:12:20: error: incompatible types when initializing type ‘__m256i’ using type ‘int’ __m256i result =…

cpu simd avx avx2

asked Oct 01 '17 at 12:53

Sagar

votes

0 answers

Why this AVX code so slow?

Well, the code is, and question is why AVX version is more slower than naive variant ? const double __declspec(align(16)) mx[4] = { 1., 1., 1., -100.}; const double __declspec(align(16)) an[8] = { 8., 7., 6., 5., 4., 3., 2., 1.}; __forceinline…

c++ performance sse simd avx

asked Sep 29 '17 at 14:02

Des Spigel

votes

1 answer

Inside virtualenv: How to get tensorflow to support sse 4.2 and avx

Just to say it upfront, I'm aware of all the answers that require bazel and they didn't work for me. I'm using virtualenv as the tensorflow website recommends to. (tensorflow27)name@computersname:~$ bazel build --linkopt='-lrt' -c opt --copt=-mavx…

tensorflow compilation virtualenv sse avx

asked Sep 02 '17 at 11:00

evolution

votes

1 answer

SIMD -> uint16_t array to float array work on float then back to uint16_t

I am currently working on a project that manipulates images. To speed up the process (and increase my knowledge), I decided to write some of the basic functions using SIMD instructions. The code using for loops is int idx; uint16_t* A, B, C; float…

c++ linux simd avx avx2

asked Sep 01 '17 at 14:23

user1273813

Prev 1 2 3

…

83 84 Next