Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.


Interesting Q&As / FAQs:

1252 questions
34
votes
4 answers

What's missing/sub-optimal in this memcpy implementation?

I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's some guy's implementation: __forceinline // Since Size is usually known, //…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
32
votes
1 answer

What are the best instruction sequences to generate vector constants on the fly?

"Best" means fewest instructions (or fewest uops, if any instructions decode to more than one uop). Machine-code size in bytes is a tie-breaker for equal insn count. Constant-generation is by its very nature the start of a fresh dependency chain,…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
32
votes
3 answers

Intel AVX: 256-bits version of dot product for double precision floating point variables

The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version (YMM register) for double precision floating point variables. The "Why?" question have been very briefly treated in another forum (here) and on Stack Overflow…
gleeen.gould
  • 599
  • 1
  • 5
  • 22
31
votes
3 answers

Is there a version of TensorFlow not compiled for AVX instructions?

I'm trying to get TensorFlow up on my Chromebook, not the best place, I know, but I just want to get a feel for it. I haven't done much work in the Python dev environment, or in any dev environment for that matter, so bear with me. After figuring…
bobe
  • 432
  • 1
  • 5
  • 11
30
votes
3 answers

Why gcc is so much worse at std::vector vectorization of a conditional multiply than clang?

Consider following float loop, compiled using -O3 -mavx2 -mfma for (auto i = 0; i < a.size(); ++i) { a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0; } Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the…
Vladislav Kogan
  • 561
  • 6
  • 15
30
votes
3 answers

How to write portable simd code for complex multiplicative reduction

I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is: #include complex float f(complex float x[], int n ) { complex float p = 1.0; for (int i = 0; i < n; i++) p *=…
Simd
  • 19,447
  • 42
  • 136
  • 271
28
votes
3 answers

How can I exchange the low 128 bits and high 128 bits in a 256 bit AVX (YMM) register

I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits. The backing story: What I really want is VHADDPS/_mm256_hadd_ps to act like…
Mark Borgerding
  • 8,117
  • 4
  • 30
  • 51
27
votes
3 answers

How to efficiently perform double/int64 conversions with SSE/AVX?

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bit integers. In other words, they are…
plasmacel
  • 8,183
  • 7
  • 53
  • 101
26
votes
2 answers

Why is it faster to perform float by float matrix multiplication compared to int by int?

Having two int matrices A and B, with more than 1000 rows and 10K columns, I often need to convert them to float matrices to gain speedup (4x or more). I'm wondering why is this the case? I realize that there is a lot of optimization and…
NULL
  • 759
  • 9
  • 18
26
votes
2 answers

How are the gather instructions in AVX2 implemented?

Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches…
Anuj Kalia
  • 803
  • 8
  • 16
26
votes
5 answers

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform…
Satya Arjunan
  • 575
  • 4
  • 11
25
votes
1 answer

Will it be feasible to use gcc's function multi-versioning without code changes?

According to most benchmarks, Intel's Clear Linux is way faster than other distributions, mostly thanks to a GCC feature called Function Multi-Versioning. Right now the method they use is to compile the code, analyze which function contains…
Alexander
  • 692
  • 6
  • 17
23
votes
5 answers

How to use AVX/pclmulqdq on Mac OS X

I am trying to compile a program that uses the pclmulqdq instruction present in new Intel processors. I've installed GCC 4.6 using macports but when I compile my program (which uses the intrinsic _mm_clmulepi64_si128), I…
Conrado
  • 716
  • 5
  • 15
23
votes
3 answers

Fastest way to do horizontal vector sum with AVX instructions

I have a packed vector of four 64-bit floating-point values. I would like to get the sum of the vector's elements. With SSE (and using 32-bit floats) I could just do the following: v_sum = _mm_hadd_ps(v_sum, v_sum); v_sum = _mm_hadd_ps(v_sum,…
Luigi Castelli
  • 676
  • 2
  • 6
  • 13
23
votes
2 answers

Fastest way to multiply an array of int64_t?

I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the…
1
2
3
83 84