Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.

Interesting Q&As / FAQs:

Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (__m256* types) require 32 bytes alignment, 512-bit memory operands (__m512* types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explains alignas, aligned_alloc, _aligned_malloc, C++17 aligned new, etc, and use of unaligned loadu / storeu intrinsics.
Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE works. , Includes in-lane vs. lane-crossing for AVX.
Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.

1252 questions

votes

4 answers

What's missing/sub-optimal in this memcpy implementation?

I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's some guy's implementation: __forceinline // Since Size is usually known, //…

asked Oct 07 '14 at 22:02

einpoklum

118,144
57
340
684

votes

1 answer

What are the best instruction sequences to generate vector constants on the fly?

"Best" means fewest instructions (or fewest uops, if any instructions decode to more than one uop). Machine-code size in bytes is a tie-breaker for equal insn count. Constant-generation is by its very nature the start of a fresh dependency chain,…

assembly x86 sse simd avx

asked Jan 29 '16 at 12:52

Peter Cordes

328,167
45
605
847

votes

3 answers

Intel AVX: 256-bits version of dot product for double precision floating point variables

The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version (YMM register) for double precision floating point variables. The "Why?" question have been very briefly treated in another forum (here) and on Stack Overflow…

c++ performance simd avx

asked May 04 '12 at 18:21

gleeen.gould

votes

3 answers

Is there a version of TensorFlow not compiled for AVX instructions?

I'm trying to get TensorFlow up on my Chromebook, not the best place, I know, but I just want to get a feel for it. I haven't done much work in the Python dev environment, or in any dev environment for that matter, so bear with me. After figuring…

python tensorflow avx

asked Dec 11 '18 at 11:28

bobe

votes

3 answers

Why gcc is so much worse at std::vector vectorization of a conditional multiply than clang?

Consider following float loop, compiled using -O3 -mavx2 -mfma for (auto i = 0; i < a.size(); ++i) { a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0; } Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the…

c++ gcc vectorization compiler-optimization avx

asked Jul 13 '23 at 23:17

Vladislav Kogan

votes

3 answers

How to write portable simd code for complex multiplicative reduction

I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is: #include complex float f(complex float x[], int n ) { complex float p = 1.0; for (int i = 0; i < n; i++) p *=…

c++ c gcc simd avx

asked Jul 25 '17 at 09:13

Simd

19,447
42
136
271

votes

3 answers

How can I exchange the low 128 bits and high 128 bits in a 256 bit AVX (YMM) register

I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits. The backing story: What I really want is VHADDPS/_mm256_hadd_ps to act like…

x86 simd avx

asked Aug 26 '11 at 20:08

Mark Borgerding

8,117
4
30
51

votes

3 answers

How to efficiently perform double/int64 conversions with SSE/AVX?

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bit integers. In other words, they are…

c++ floating-point sse simd avx

asked Dec 14 '16 at 14:09

plasmacel

8,183
7
53
101

votes

2 answers

Why is it faster to perform float by float matrix multiplication compared to int by int?

Having two int matrices A and B, with more than 1000 rows and 10K columns, I often need to convert them to float matrices to gain speedup (4x or more). I'm wondering why is this the case? I realize that there is a lot of optimization and…

c++ numpy matrix eigen avx

asked Jul 28 '17 at 12:33

NULL

votes

2 answers

How are the gather instructions in AVX2 implemented?

Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches…

intel ram simd avx avx2

asked Feb 14 '14 at 08:39

Anuj Kalia

votes

5 answers

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform…

c x86 simd avx avx2

asked Feb 07 '14 at 07:55

Satya Arjunan

votes

1 answer

Will it be feasible to use gcc's function multi-versioning without code changes?

According to most benchmarks, Intel's Clear Linux is way faster than other distributions, mostly thanks to a GCC feature called Function Multi-Versioning. Right now the method they use is to compile the code, analyze which function contains…

c linux gcc compiler-optimization avx

asked Feb 20 '18 at 10:14

Alexander

votes

5 answers

How to use AVX/pclmulqdq on Mac OS X

I am trying to compile a program that uses the pclmulqdq instruction present in new Intel processors. I've installed GCC 4.6 using macports but when I compile my program (which uses the intrinsic _mm_clmulepi64_si128), I…

gcc assembly osx-lion macports avx

asked Mar 23 '12 at 13:40

Conrado

votes

3 answers

Fastest way to do horizontal vector sum with AVX instructions

I have a packed vector of four 64-bit floating-point values. I would like to get the sum of the vector's elements. With SSE (and using 32-bit floats) I could just do the following: v_sum = _mm_hadd_ps(v_sum, v_sum); v_sum = _mm_hadd_ps(v_sum,…

x86 sse simd avx vector-processing

asked Mar 19 '12 at 18:11

Luigi Castelli

votes

2 answers

Fastest way to multiply an array of int64_t?

I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the…

c vectorization multiplication avx avx2

asked May 18 '16 at 10:01

Hélder Gonçalves

Prev 1

…

83 84 Next