Questions tagged [avx512]

AVX512 is Intel's next generation of SIMD instructions that widens vectors to 512-bit, and adds new functionality (masking) and more vector registers.

AVX512 is a set of instruction set extensions for x86 that features 512-bit SIMD vectors.

Wikipedia's AVX-512 article is kept up to date with lists of the sub-extensions, and a handy table of which CPUs support which extensions: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

Other resources:


AVX512 is broken into the sub-extensions including the following. While all AVX512 implementations are required to support AVX512-F, the rest are optional.

  • AVX512-F (Foundation)
  • AVX512-CD (Conflict Detection)
  • AVX512-ER (Exponential and Reciprocal)
  • AVX512-PF (Prefetch)
  • AVX512-BW (Byte and Word instructions)
  • AVX512-DQ (Double-word and quad-word instructions)
  • AVX512-VL (Vector Length)
  • AVX512-IFMA (52-bit Integer Multiply-Add)
  • AVX512-VBMI (Vector Byte-Manipulation)
  • AVX512-VPOPCNT (Vector Population Count)
  • AVX512-4FMAPS (4 x Fused Multiply-Add Single Precision)
  • AVX512-4VNNIW (4 x Neural Network Instructions)
  • AVX512-VBMI2 (Vector Byte-Manipulation 2)
  • AVX512-VNNI (Neural Network Instructions?)
  • AVX512-BITALG (Bit Algorithms)
  • AVX512-VAES (Vector AES Instructions)
  • AVX512-VGFI (Galois Field Arithmetic)
  • AVX512-VPCLMULQ (Vector Carry-less Multiply)

Supporting Processors:

  • Intel Xeon Phi Knights Landing: AVX512-(F, CD, ER, PF)
  • Intel Xeon Phi Knights Mill: AVX512-(F, CD, ER, PF, VPOPCNT, 4FMAPS, 4VNNIW)
  • Intel Skylake Xeon: AVX512-(F, CD, BW, DQ, VL)
  • Intel Cannonlake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI)
  • Intel Ice Lake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI, VPOPCNT, VBMI2, VNNI, BITALG, VAES, VGFI, VPCLMULQ)

Foundation (AVX512-F):

All implementations of AVX512 are required to support AVX512-F. AVX512-F expands AVX by doubling the size of the vector width to 512 bits and double the number of registers to 32. It also provides embedded masking by means of 8 opmask registers.

AVX512-F only supports operations on 32-bit and 64-bit words and only operates on zmm (512-bit) registers.

Conflict Detection (AVX512-CD):

AVx512-CD aids vectorization by providing instructions to detect data conflicts.

Exponential and Reciprocal (AVX512-ER):

AVX512-ER provides instructions for computing the reciprocal and exponential functions with increased accuracy. These are used to aid in the fast computation of trigonometric functions.

Prefetch (AVX512-PF):

AVX512-PF provides instructions for vector gather/scatter prefetching.

Byte and Word (AVX512-BW):

AVX512-BW extends AVX512-F by adding support for byte and word (8/16-bit) operations.

Double-word and Quad-word (AVX512-DQ):

AVX512-DQ extends AVX512-F by providing more instructions for 32-bit and 64-bit data.

Vector-Length (AVX512-VL):

AVX512-VL extends AVX512-F by allowing the full AVX512 functionality to operate on xmm and ymm registers (as opposed to only zmm). This includes the masking as well as the increased register count of 32.

52-bit Integer Multiply-Add (AVX512-IFMA):

AVX512-IFMA provides fused multiply-add instructions for 52-bit integers. (Speculation: likely derived from the floating-point FMA hardware)

Vector Bit-Manipulation (AVX512-VBMI):

AVX512-VBMI provides instructions for byte-permutation. It extends the existing permute instructions to byte-granularity.

Vector Population Count (AVX512-VPOPCNT)

A vectorized version of the popcnt instruction for 32-bit and 64-bit words.

4 x Fused Multiply-Add Single Precision (AVX512-4FMAPS)

AVX512-4FMAPS provides instructions that perform 4 consecutive single-precision FMAs.

Neural Network Instructions (AVX512-4VNNIW)

Specialized instructions on 16-bit integers for Neural Networks. These follow the same "4 consecutive" op instruction format as AVX512-4FMAPS.

Vector Byte-Manipulation 2 (AVX512-VBMI2)

Extends AVX512-VBMI by adding support for compress/expand on byte-granular word sizes.

Neural Network Instructions (AVX512-VNNI)

Specialized instructions for Neural Networks. This is the desktop/Xeon version of AVX512-4VNNIW on Knights Mill Xeon Phi.

Bit Algorithms (AVX512-BITALG)

Extends AVX512-VPOPCNT to word and 8-bit and 16-bit words. Adds additional bit manipulation instructions.

Vector AES Instructions (AVX512-VAES)

Extends the existing AES-NI instructions to 512-bit width.

Galois Field Arithmetic (AVX512-VGFI)

Arithmetic for Galois Fields.

Vector Carry-less Multiply (AVX512-VPCLMULQ)

Vectorized version of the pclmulqdq instruction.

349 questions
-1
votes
1 answer

avx slower then sse multimedia extensions

I am programming a perfect program to parallelize with multimedia extensions. The program consists of transforming an image, so i go over a matrix and i modify each pixel inside it. For go over faster, i use multimedia extensions: At first i used…
-2
votes
1 answer

For what Intel's AVX-512 has 32 (so many!) 512-bit register vectors, ZMM0 to ZMM31?

Intel's AVX512 technology supports parallelization due to multiple subregisters, e.g. there are 8 64-bit FP-subregisters in each 512-bit vector register. And what, the multiple vector registers may operate in parallel as well? Does the following…
-3
votes
0 answers

How to load 64B data from memory to a register in x86-64 assembly?

I want to load data from memory to a register in x86-64 assembly. If %rcx saves the memory address and %rax is the destination register, I can do a load in 8B/16B/32B granularity using the following instructions: mov (%%rcx), %%rax <= 8B…
1 2 3
23
24