Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.


Interesting Q&As / FAQs:

1252 questions
23
votes
2 answers

FMA3 in GCC: how to enable

I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile. SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp AVX:…
Z boson
  • 32,619
  • 11
  • 123
  • 226
22
votes
6 answers

How to use the Intel AVX in Java?

How do I use the Intel AVX vector instruction set from Java? It's a simple question but the answer seems to be hard to find.
Albert Hendriks
  • 1,979
  • 3
  • 25
  • 45
22
votes
5 answers

Transpose an 8x8 float using AVX/AVX2

Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for. In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However,…
DavidS
  • 1,660
  • 1
  • 12
  • 26
22
votes
5 answers

How to combine two __m128 values to __m256?

I would like to combine two __m128 values to one __m256. Something like this: __m128 a = _mm_set_ps(1, 2, 3, 4); __m128 b = _mm_set_ps(5, 6, 7, 8); to something like: __m256 c = { 1, 2, 3, 4, 5, 6, 7, 8 }; are there any intrinsics that I can…
user1468756
  • 331
  • 2
  • 8
21
votes
2 answers

Choice between aligned vs. unaligned x86 SIMD instructions

There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr…
MikeF
  • 1,021
  • 9
  • 29
21
votes
3 answers

How to solve the 32-byte-alignment issue for AVX load/store operations?

I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example: #include #include inline void ones(float *a) { __m256 out_aligned =…
romeric
  • 2,325
  • 3
  • 19
  • 35
21
votes
2 answers

Measuring memory bandwidth from the dot product of two arrays

The dot product of two arrays for(int i=0; i
Z boson
  • 32,619
  • 11
  • 123
  • 226
20
votes
5 answers

Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

Modern x86_64 linux with glibc will detect that CPU has support of AVX extension and will switch many string functions from generic implementation to AVX-optimized version (with help of ifunc dispatchers: 1, 2). This feature can be good for…
osgx
  • 90,338
  • 53
  • 357
  • 513
20
votes
5 answers

SSE-copy, AVX-copy and std::copy performance

I'm tried to improve performance of copy operation via SSE and AVX: #include const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); …
gorill
  • 1,623
  • 3
  • 20
  • 29
20
votes
2 answers

How to sum __m256 horizontally?

I would like to horizontally sum the components of a __m256 vector using AVX instructions. In SSE I could use _mm_hadd_ps(xmm,xmm); _mm_hadd_ps(xmm,xmm); to get the result at the first component of the vector, but this does not scale with the 256…
Yoav
  • 5,962
  • 5
  • 39
  • 61
19
votes
4 answers

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the…
Luigi Castelli
  • 676
  • 2
  • 6
  • 13
19
votes
2 answers

Half-precision floating-point arithmetic on Intel chips

Is it possible to perform half-precision floating-point arithmetic on Intel chips? I know how to load/store/convert half-precision floating-point numbers [1] but I do not know how to add/multiply them without converting to single-precision…
Kadir
  • 1,345
  • 3
  • 15
  • 25
19
votes
1 answer

Is it worth bothering to align AVX-256 memory stores?

According to the Intel® 64 and IA-32 Architectures Optimization Reference Manual, section B.4 ("Performance Tuning Techniques for Intel® Microarchitecture Code Name Sandy Bridge"), subsection B.4.5.2 ("Assists"): 32-byte AVX store instructions that…
Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
19
votes
2 answers

How to rotate an SSE/AVX vector

I need to perform a rotate operation with as little clock cycles as possible. In the first case let's assume __m128i as source and dest type: source: || A0 || A1 || A2 || A3 || dest: || A1 || A2 || A3 || A0 || dest =…
user1584773
  • 699
  • 7
  • 19
18
votes
2 answers

Reference manual/tutorial for x86 SIMD intrinsics?

I'm looking into using these to improve the performance of some code but good documentation seems hard to find for the functions defined in the *mmintrin.h headers, can anybody provide me with pointers to good info on these? EDIT: particularly…
BD at Rivenhill
  • 12,395
  • 10
  • 46
  • 49
1 2
3
83 84