Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.

Interesting Q&As / FAQs:

Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (__m256* types) require 32 bytes alignment, 512-bit memory operands (__m512* types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explains alignas, aligned_alloc, _aligned_malloc, C++17 aligned new, etc, and use of unaligned loadu / storeu intrinsics.
Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE works. , Includes in-lane vs. lane-crossing for AVX.
Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.

1252 questions

votes

1 answer

Loading 128 bits of mixed float+int data?

I have a struct which has the following composition: static constexpr uint64_t emptyStructValue { 0 }; union MyStruct { explicit MyStruct(uint64_t comp) : composite(comp){} struct{ int16_t a; bool b; bool c; …

asked Jun 07 '15 at 13:00

user997112

29,025
43
182
361

votes

1 answer

AVX - storing __256 vector back to the memory (void**) in C,

I have the following code extract written in C, double* res; posix_memalign((void **)&res, 32, sizeof(double)*4); __m256 ymm0, ymm1, ymm2, ymm3; ymm0 = _mm256_load_pd(vector_a); ymm1 = _mm256_load_pd(vector_b); ymm2 =…

c vector void-pointers memory-alignment avx

asked Jun 06 '15 at 08:22

lukieleetronic

votes

0 answers

AVX _mm256_sin_ps missing on OSX i7 AVX2 Retina MacBook Pro

The Intel Intrinsics Guide lists _mm256_sin_ps as an available function with the header immintrin.h and the AVX flag, yet is seems to be missing from XCode / OSX. I do have an AVX2 machine and other AVX intrinsics work fine, am I missing something…

macos intel intrinsics avx

asked May 21 '15 at 00:25

bitwise

votes

1 answer

SSE Sum of multiplication of 4 32-bit integers

Thanks to this post I found out how to multiply 4 32-bit integers. What I want to do now is sum up the results. How can I do this using intrinsics? I've got access to SSE, SSE2 and AVX. My initial thoughts were to unload res into an int array and…

c sse simd avx sse2

asked May 17 '15 at 15:47

Harrold

votes

1 answer

Most efficient way to test a 256-bit YMM AVX register element for equal or less than zero

I'm implementing a particle system using Intel AVX intrinsics. When the Y-position of a particle is less than or equal to zero I want to reset the particle. The particle system is ordered in a SOA-pattern like this: class ParticleSystem { …

c++ x86 simd avx

asked May 11 '15 at 12:44

SvinSimpe

votes

1 answer

AVX equivalent for _mm_storeu_ps?

I have quite a fast AVX code, but it's just one single function using AVX, the rest of the huge project is on SSE2, so I do NOT want to set architecture to AVX. At the end of each iteration I need to convert the 4 doubles in one YMM register to 4…

sse intrinsics avx

asked Mar 29 '15 at 17:20

mrzacek mrzacek

votes

0 answers

AVX assembler loop gets slowed down 3x by vunpcklpd instruction

I'm fighting with optimizing this loop using AVX (excerpt only, NASM syntax): .repete: vmulpd ymm4, ymm1, ymm2 vhaddpd ymm5, ymm4, ymm4 vextractf128 xmm6, ymm5, 1 vaddsd xmm5, xmm5, xmm6 vcvtss2sd xmm7, [MSI + MCX * 4] vmulsd xmm3, xmm7,…

performance nasm avx

asked Mar 28 '15 at 23:55

mrzacek mrzacek

votes

1 answer

prefetching pd (4 double) into __m256d register

I want to prefetch some data using AVX. I was checking the Intel IntrisicsGuide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) but there exists only the _mm_prefetch(...) for SSE. Does anyone know a workaround for AVX? Update…

avx prefetch

asked Feb 19 '15 at 18:17

LeoW.

votes

2 answers

Why is Julia asking for AVX instructions on Ubuntu 14.04?

On my Ubuntu 14.04 box, Julia is complaining that my machine doesn't support AVX instructions. What may be the reason for this?

julia avx openblas

asked Feb 02 '15 at 16:05

Naren Yellavula

7,273
2
29
23

votes

1 answer

Equal zero instruction in SSE

Suppose I have a 128-bit integer vector: __m128i x; Then how to know if all the bits in x are zeros? Checking every packed integer is a simple approach. But I'm looking for a faster way. Is there any instruction in SSE can do this job?

c++ sse avx

asked Jan 14 '15 at 12:09

KUN

votes

1 answer

Segmentation Fault when using vmovupd

I am trying to input four flouting point numbers at time into the stack then transfer it into a ymm(avx) register. A friend of mine is working on the same project and our code seems identical but I'm getting a core dump when I call vmovupd ymm0,…

assembly 64-bit nasm avx

asked Sep 18 '14 at 21:14

wmurmann

votes

0 answers

Subtract content of vector from scalar

I try to optimize by code for different SIMD architectures. What is best way to calculate the following: For SSE: float s = something __m128 v = calculation result s -= v[0] + v[1] + v[2] + v[3] At the moment I calculate the horizontal sum…

c++ sse avx

asked Sep 06 '14 at 13:49

Maik

votes

0 answers

VEXTRACTF128 versus VEXTRACTI128

As far as I can tell the VEXTRACTF128 and VEXTRACTI128 instructions do the same things, have the same latency, same throughput, and use the same ports. The only difference I cant tell between them is that VEXTRACTF128 only requires AVX VEXTRACTI128…

x86 intrinsics avx avx2

asked Sep 05 '14 at 11:03

Z boson

32,619
11
123
226

votes

0 answers

process 8-bit int with AVX

Long story short, i've been trying to learn a new programming paradigm and get out of my comfort zone of just being someone who writes code to an individual that actually understands what's going on behind the scenes. I've read several assembly…

c intel sse simd avx

asked Aug 25 '14 at 00:36

Ray Renaldi

votes

0 answers

Optimized build for AMD Piledriver arch-Unreal Engine 4

To begin with, in order to use Unreal Engine 4 you have to build it using Visual Studio 2013.In other words, you are able to optimize compiler settings in order to optimize overall performance. However, I am a little bit newbie when it come down to…

visual-studio-2013 compiler-optimization avx amd-processor

asked Jul 15 '14 at 22:10

foldingAthellas

Prev 1 2 3

…

83 84 Next