Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.


Interesting Q&As / FAQs:

1252 questions
0
votes
1 answer

Loading 128 bits of mixed float+int data?

I have a struct which has the following composition: static constexpr uint64_t emptyStructValue { 0 }; union MyStruct { explicit MyStruct(uint64_t comp) : composite(comp){} struct{ int16_t a; bool b; bool c; …
user997112
  • 29,025
  • 43
  • 182
  • 361
0
votes
1 answer

AVX - storing __256 vector back to the memory (void**) in C,

I have the following code extract written in C, double* res; posix_memalign((void **)&res, 32, sizeof(double)*4); __m256 ymm0, ymm1, ymm2, ymm3; ymm0 = _mm256_load_pd(vector_a); ymm1 = _mm256_load_pd(vector_b); ymm2 =…
lukieleetronic
  • 623
  • 7
  • 10
  • 23
0
votes
0 answers

AVX _mm256_sin_ps missing on OSX i7 AVX2 Retina MacBook Pro

The Intel Intrinsics Guide lists _mm256_sin_ps as an available function with the header immintrin.h and the AVX flag, yet is seems to be missing from XCode / OSX. I do have an AVX2 machine and other AVX intrinsics work fine, am I missing something…
bitwise
  • 541
  • 6
  • 16
0
votes
1 answer

SSE Sum of multiplication of 4 32-bit integers

Thanks to this post I found out how to multiply 4 32-bit integers. What I want to do now is sum up the results. How can I do this using intrinsics? I've got access to SSE, SSE2 and AVX. My initial thoughts were to unload res into an int array and…
Harrold
  • 11
  • 1
0
votes
1 answer

Most efficient way to test a 256-bit YMM AVX register element for equal or less than zero

I'm implementing a particle system using Intel AVX intrinsics. When the Y-position of a particle is less than or equal to zero I want to reset the particle. The particle system is ordered in a SOA-pattern like this: class ParticleSystem { …
SvinSimpe
  • 850
  • 1
  • 12
  • 28
0
votes
1 answer

AVX equivalent for _mm_storeu_ps?

I have quite a fast AVX code, but it's just one single function using AVX, the rest of the huge project is on SSE2, so I do NOT want to set architecture to AVX. At the end of each iteration I need to convert the 4 doubles in one YMM register to 4…
mrzacek mrzacek
  • 308
  • 2
  • 12
0
votes
0 answers

AVX assembler loop gets slowed down 3x by vunpcklpd instruction

I'm fighting with optimizing this loop using AVX (excerpt only, NASM syntax): .repete: vmulpd ymm4, ymm1, ymm2 vhaddpd ymm5, ymm4, ymm4 vextractf128 xmm6, ymm5, 1 vaddsd xmm5, xmm5, xmm6 vcvtss2sd xmm7, [MSI + MCX * 4] vmulsd xmm3, xmm7,…
mrzacek mrzacek
  • 308
  • 2
  • 12
0
votes
1 answer

prefetching pd (4 double) into __m256d register

I want to prefetch some data using AVX. I was checking the Intel IntrisicsGuide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) but there exists only the _mm_prefetch(...) for SSE. Does anyone know a workaround for AVX? Update…
LeoW.
  • 27
  • 6
0
votes
2 answers

Why is Julia asking for AVX instructions on Ubuntu 14.04?

On my Ubuntu 14.04 box, Julia is complaining that my machine doesn't support AVX instructions. What may be the reason for this?
Naren Yellavula
  • 7,273
  • 2
  • 29
  • 23
0
votes
1 answer

Equal zero instruction in SSE

Suppose I have a 128-bit integer vector: __m128i x; Then how to know if all the bits in x are zeros? Checking every packed integer is a simple approach. But I'm looking for a faster way. Is there any instruction in SSE can do this job?
KUN
  • 527
  • 4
  • 18
0
votes
1 answer

Segmentation Fault when using vmovupd

I am trying to input four flouting point numbers at time into the stack then transfer it into a ymm(avx) register. A friend of mine is working on the same project and our code seems identical but I'm getting a core dump when I call vmovupd ymm0,…
wmurmann
  • 115
  • 2
  • 13
0
votes
0 answers

Subtract content of vector from scalar

I try to optimize by code for different SIMD architectures. What is best way to calculate the following: For SSE: float s = something __m128 v = calculation result s -= v[0] + v[1] + v[2] + v[3] At the moment I calculate the horizontal sum…
Maik
  • 541
  • 4
  • 15
0
votes
0 answers

VEXTRACTF128 versus VEXTRACTI128

As far as I can tell the VEXTRACTF128 and VEXTRACTI128 instructions do the same things, have the same latency, same throughput, and use the same ports. The only difference I cant tell between them is that VEXTRACTF128 only requires AVX VEXTRACTI128…
Z boson
  • 32,619
  • 11
  • 123
  • 226
0
votes
0 answers

process 8-bit int with AVX

Long story short, i've been trying to learn a new programming paradigm and get out of my comfort zone of just being someone who writes code to an individual that actually understands what's going on behind the scenes. I've read several assembly…
Ray Renaldi
  • 57
  • 1
  • 11
0
votes
0 answers

Optimized build for AMD Piledriver arch-Unreal Engine 4

To begin with, in order to use Unreal Engine 4 you have to build it using Visual Studio 2013.In other words, you are able to optimize compiler settings in order to optimize overall performance. However, I am a little bit newbie when it come down to…