Questions tagged [avx512]

AVX512 is Intel's next generation of SIMD instructions that widens vectors to 512-bit, and adds new functionality (masking) and more vector registers.

AVX512 is a set of instruction set extensions for x86 that features 512-bit SIMD vectors.

Wikipedia's AVX-512 article is kept up to date with lists of the sub-extensions, and a handy table of which CPUs support which extensions: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

Other resources:


AVX512 is broken into the sub-extensions including the following. While all AVX512 implementations are required to support AVX512-F, the rest are optional.

  • AVX512-F (Foundation)
  • AVX512-CD (Conflict Detection)
  • AVX512-ER (Exponential and Reciprocal)
  • AVX512-PF (Prefetch)
  • AVX512-BW (Byte and Word instructions)
  • AVX512-DQ (Double-word and quad-word instructions)
  • AVX512-VL (Vector Length)
  • AVX512-IFMA (52-bit Integer Multiply-Add)
  • AVX512-VBMI (Vector Byte-Manipulation)
  • AVX512-VPOPCNT (Vector Population Count)
  • AVX512-4FMAPS (4 x Fused Multiply-Add Single Precision)
  • AVX512-4VNNIW (4 x Neural Network Instructions)
  • AVX512-VBMI2 (Vector Byte-Manipulation 2)
  • AVX512-VNNI (Neural Network Instructions?)
  • AVX512-BITALG (Bit Algorithms)
  • AVX512-VAES (Vector AES Instructions)
  • AVX512-VGFI (Galois Field Arithmetic)
  • AVX512-VPCLMULQ (Vector Carry-less Multiply)

Supporting Processors:

  • Intel Xeon Phi Knights Landing: AVX512-(F, CD, ER, PF)
  • Intel Xeon Phi Knights Mill: AVX512-(F, CD, ER, PF, VPOPCNT, 4FMAPS, 4VNNIW)
  • Intel Skylake Xeon: AVX512-(F, CD, BW, DQ, VL)
  • Intel Cannonlake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI)
  • Intel Ice Lake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI, VPOPCNT, VBMI2, VNNI, BITALG, VAES, VGFI, VPCLMULQ)

Foundation (AVX512-F):

All implementations of AVX512 are required to support AVX512-F. AVX512-F expands AVX by doubling the size of the vector width to 512 bits and double the number of registers to 32. It also provides embedded masking by means of 8 opmask registers.

AVX512-F only supports operations on 32-bit and 64-bit words and only operates on zmm (512-bit) registers.

Conflict Detection (AVX512-CD):

AVx512-CD aids vectorization by providing instructions to detect data conflicts.

Exponential and Reciprocal (AVX512-ER):

AVX512-ER provides instructions for computing the reciprocal and exponential functions with increased accuracy. These are used to aid in the fast computation of trigonometric functions.

Prefetch (AVX512-PF):

AVX512-PF provides instructions for vector gather/scatter prefetching.

Byte and Word (AVX512-BW):

AVX512-BW extends AVX512-F by adding support for byte and word (8/16-bit) operations.

Double-word and Quad-word (AVX512-DQ):

AVX512-DQ extends AVX512-F by providing more instructions for 32-bit and 64-bit data.

Vector-Length (AVX512-VL):

AVX512-VL extends AVX512-F by allowing the full AVX512 functionality to operate on xmm and ymm registers (as opposed to only zmm). This includes the masking as well as the increased register count of 32.

52-bit Integer Multiply-Add (AVX512-IFMA):

AVX512-IFMA provides fused multiply-add instructions for 52-bit integers. (Speculation: likely derived from the floating-point FMA hardware)

Vector Bit-Manipulation (AVX512-VBMI):

AVX512-VBMI provides instructions for byte-permutation. It extends the existing permute instructions to byte-granularity.

Vector Population Count (AVX512-VPOPCNT)

A vectorized version of the popcnt instruction for 32-bit and 64-bit words.

4 x Fused Multiply-Add Single Precision (AVX512-4FMAPS)

AVX512-4FMAPS provides instructions that perform 4 consecutive single-precision FMAs.

Neural Network Instructions (AVX512-4VNNIW)

Specialized instructions on 16-bit integers for Neural Networks. These follow the same "4 consecutive" op instruction format as AVX512-4FMAPS.

Vector Byte-Manipulation 2 (AVX512-VBMI2)

Extends AVX512-VBMI by adding support for compress/expand on byte-granular word sizes.

Neural Network Instructions (AVX512-VNNI)

Specialized instructions for Neural Networks. This is the desktop/Xeon version of AVX512-4VNNIW on Knights Mill Xeon Phi.

Bit Algorithms (AVX512-BITALG)

Extends AVX512-VPOPCNT to word and 8-bit and 16-bit words. Adds additional bit manipulation instructions.

Vector AES Instructions (AVX512-VAES)

Extends the existing AES-NI instructions to 512-bit width.

Galois Field Arithmetic (AVX512-VGFI)

Arithmetic for Galois Fields.

Vector Carry-less Multiply (AVX512-VPCLMULQ)

Vectorized version of the pclmulqdq instruction.

349 questions
8
votes
2 answers

AVX-512 and Branching

I'm confused as to what masking can do in theory in relation to branches. Let's say I have a Skylake-SP (ha, I wish..), and we're ignoring compiler capabilities, just what's possible in theory: If a branch conditional is dependant on a static flag,…
Michel Müller
  • 5,535
  • 3
  • 31
  • 49
8
votes
2 answers

How can I write a QuadWord from AVX512 register zmm26 to the rax register?

I wish to perform integer arithmetic operations on Quad Word elements of the zmm 0-31 register set and preserve the carry bit resulting from those operations. It appears this is only possible if the data were worked on in the general register…
jgr2015
  • 81
  • 3
8
votes
3 answers

Horizontal add with __m512 (AVX512)

How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm_hadd_ps and _mm256_hadd_ps but there is no…
Rouslan
  • 93
  • 1
  • 6
7
votes
1 answer

Why is transforming an array using AVX-512 instructions significantly slower when transforming it in batches of 8 compared to 7 or 9?

Please consider the following minimal example minimal.cpp (https://godbolt.org/z/x7dYes91M). #include #include #include #include #include #include #define NUMBER_OF_TUPLES…
7
votes
1 answer

Missing AVX-512 intrinsics for masks?

Intel's intrinsics guide lists a number of intrinsics for the AVX-512 K* mask instructions, but there seem to be a few missing: KSHIFT{L/R} KADD KTEST The Intel developer manual claims that intrinsics are not necessary as they are auto generated…
zinga
  • 769
  • 7
  • 17
7
votes
0 answers

AVX512 log2 or pow instructions

I need a AVX512 double pow(double, int n) function (I need it for a binomial distribution calculation which needs to be exact). In particular I would like this for Knights Landing which has AVX512ER. One way to get this is x^n =…
Z boson
  • 32,619
  • 11
  • 123
  • 226
6
votes
1 answer

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best…
thnghh
  • 81
  • 1
  • 8
6
votes
1 answer

How to emulate _mm256_loadu_epi32 with gcc or clang?

Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32: _m256i _mm256_loadu_epi32 (void const* mem_addr); /* Instruction: vmovdqu32 ymm, m256 CPUID Flags: AVX512VL + AVX512F Description Load 256-bits (composed of 8 packed…
Walter
  • 44,150
  • 20
  • 113
  • 196
6
votes
2 answers

BMI for generating masks with AVX512

I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations. Here is the code I am…
Z boson
  • 32,619
  • 11
  • 123
  • 226
6
votes
1 answer

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

The Intel intrinsics guide states simply that _mm512_load_epi32: Load[s] 512-bits (composed of 16 packed 32-bit integers) from memory into dst and that _mm512_load_si512: Load[s] 512-bits of integer data from memory into dst What is the…
Qix - MONICA WAS MISTREATED
  • 14,451
  • 16
  • 82
  • 145
6
votes
2 answers

invalid register for .seh_savexmm in Cygwin

$ make i have worked with cygwin but got compile error. I am not sure what is invalid register for .seh_savexmm please help me. I searched this problem on google but not find there are many problems but not soultion. Please help me. perl…
X zheng
  • 1,731
  • 1
  • 17
  • 25
6
votes
1 answer

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly. I'm writing software that operates on bytes and words…
user1649948
  • 651
  • 4
  • 12
6
votes
1 answer

Embedded broadcasts with intrinsics and assembly

In section 2.5.3 "Broadcasts" of the Intel Architecture Instruction Set Extensions Programming Reference the we learn than AVX512 (and Knights Corner) has a bit-field to encode data broadcast for some load-op instructions, i.e. instructions that…
Z boson
  • 32,619
  • 11
  • 123
  • 226
6
votes
2 answers

Why doesn't Intel design its SIMD ISAs in a more compatible or universal way?

Intel has several SIMD ISAs, such as SSE, AVX, AVX2, AVX-512 and IMCI on Xeon Phi. These ISAs are supported on different processors. For example, AVX-512 BW, AVX-512 DQ and AVX-512 VL are only supported on Skylake, but not on Xeon Phi. AVX-512F,…
thierry
  • 217
  • 2
  • 12
6
votes
2 answers

What is meant by "fixing up" floats?

I was looking through the instruction set in AVX-512 and noticed a set of fixup instructions. Some examples: _mm512_fixupimm_pd, _mm512_mask_fixupimm_pd, _mm512_maskz_fixupimm_pd _mm512_fixupimm_round_pd, _mm512_mask_fixupimm_round_pd,…
Simon Verbeke
  • 2,905
  • 8
  • 36
  • 55
1 2
3
23 24