Questions tagged [avx512]

AVX512 is Intel's next generation of SIMD instructions that widens vectors to 512-bit, and adds new functionality (masking) and more vector registers.

AVX512 is a set of instruction set extensions for x86 that features 512-bit SIMD vectors.

Wikipedia's AVX-512 article is kept up to date with lists of the sub-extensions, and a handy table of which CPUs support which extensions: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

Other resources:

Overview: Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions
Slides from a talk by Kirill Yukhin, introducing the new features of AVX-512 like masking and embedded-rounding. (With Intel-syntax asm examples.) Includes some use-case examples like conflict-detection for histograms using gather/scatter.
x86 tag wiki for x86 performance info,
especially https://uops.info/ and https://agner.org/optimize/
sse tag wiki for guides to x86 SIMD in general.

AVX512 is broken into the sub-extensions including the following. While all AVX512 implementations are required to support AVX512-F, the rest are optional.

AVX512-F (Foundation)
AVX512-CD (Conflict Detection)
AVX512-ER (Exponential and Reciprocal)
AVX512-PF (Prefetch)
AVX512-BW (Byte and Word instructions)
AVX512-DQ (Double-word and quad-word instructions)
AVX512-VL (Vector Length)
AVX512-IFMA (52-bit Integer Multiply-Add)
AVX512-VBMI (Vector Byte-Manipulation)
AVX512-VPOPCNT (Vector Population Count)
AVX512-4FMAPS (4 x Fused Multiply-Add Single Precision)
AVX512-4VNNIW (4 x Neural Network Instructions)
AVX512-VBMI2 (Vector Byte-Manipulation 2)
AVX512-VNNI (Neural Network Instructions?)
AVX512-BITALG (Bit Algorithms)
AVX512-VAES (Vector AES Instructions)
AVX512-VGFI (Galois Field Arithmetic)
AVX512-VPCLMULQ (Vector Carry-less Multiply)

Supporting Processors:

Intel Xeon Phi Knights Landing: AVX512-(F, CD, ER, PF)
Intel Xeon Phi Knights Mill: AVX512-(F, CD, ER, PF, VPOPCNT, 4FMAPS, 4VNNIW)
Intel Skylake Xeon: AVX512-(F, CD, BW, DQ, VL)
Intel Cannonlake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI)
Intel Ice Lake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI, VPOPCNT, VBMI2, VNNI, BITALG, VAES, VGFI, VPCLMULQ)

Foundation (AVX512-F):

All implementations of AVX512 are required to support AVX512-F. AVX512-F expands AVX by doubling the size of the vector width to 512 bits and double the number of registers to 32. It also provides embedded masking by means of 8 opmask registers.

AVX512-F only supports operations on 32-bit and 64-bit words and only operates on zmm (512-bit) registers.

Conflict Detection (AVX512-CD):

AVx512-CD aids vectorization by providing instructions to detect data conflicts.

Exponential and Reciprocal (AVX512-ER):

AVX512-ER provides instructions for computing the reciprocal and exponential functions with increased accuracy. These are used to aid in the fast computation of trigonometric functions.

Prefetch (AVX512-PF):

AVX512-PF provides instructions for vector gather/scatter prefetching.

Byte and Word (AVX512-BW):

AVX512-BW extends AVX512-F by adding support for byte and word (8/16-bit) operations.

Double-word and Quad-word (AVX512-DQ):

AVX512-DQ extends AVX512-F by providing more instructions for 32-bit and 64-bit data.

Vector-Length (AVX512-VL):

AVX512-VL extends AVX512-F by allowing the full AVX512 functionality to operate on xmm and ymm registers (as opposed to only zmm). This includes the masking as well as the increased register count of 32.

52-bit Integer Multiply-Add (AVX512-IFMA):

AVX512-IFMA provides fused multiply-add instructions for 52-bit integers. (Speculation: likely derived from the floating-point FMA hardware)

Vector Bit-Manipulation (AVX512-VBMI):

AVX512-VBMI provides instructions for byte-permutation. It extends the existing permute instructions to byte-granularity.

Vector Population Count (AVX512-VPOPCNT)

A vectorized version of the popcnt instruction for 32-bit and 64-bit words.

4 x Fused Multiply-Add Single Precision (AVX512-4FMAPS)

AVX512-4FMAPS provides instructions that perform 4 consecutive single-precision FMAs.

Neural Network Instructions (AVX512-4VNNIW)

Specialized instructions on 16-bit integers for Neural Networks. These follow the same "4 consecutive" op instruction format as AVX512-4FMAPS.

Vector Byte-Manipulation 2 (AVX512-VBMI2)

Extends AVX512-VBMI by adding support for compress/expand on byte-granular word sizes.

Neural Network Instructions (AVX512-VNNI)

Specialized instructions for Neural Networks. This is the desktop/Xeon version of AVX512-4VNNIW on Knights Mill Xeon Phi.

Bit Algorithms (AVX512-BITALG)

Extends AVX512-VPOPCNT to word and 8-bit and 16-bit words. Adds additional bit manipulation instructions.

Vector AES Instructions (AVX512-VAES)

Extends the existing AES-NI instructions to 512-bit width.

Galois Field Arithmetic (AVX512-VGFI)

Arithmetic for Galois Fields.

Vector Carry-less Multiply (AVX512-VPCLMULQ)

Vectorized version of the pclmulqdq instruction.

349 questions

votes

2 answers

AVX-512 and Branching

I'm confused as to what masking can do in theory in relation to branches. Let's say I have a Skylake-SP (ha, I wish..), and we're ignoring compiler capabilities, just what's possible in theory: If a branch conditional is dependant on a static flag,…

x86 fortran vectorization simd avx512

asked Nov 25 '17 at 01:06

Michel Müller

5,535
3
31
49

votes

2 answers

How can I write a QuadWord from AVX512 register zmm26 to the rax register?

I wish to perform integer arithmetic operations on Quad Word elements of the zmm 0-31 register set and preserve the carry bit resulting from those operations. It appears this is only possible if the data were worked on in the general register…

assembly x86 intel avx512

asked Aug 08 '15 at 13:16

jgr2015

votes

3 answers

Horizontal add with __m512 (AVX512)

How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm_hadd_ps and _mm256_hadd_ps but there is no…

simd intrinsics avx512

asked Nov 12 '14 at 20:58

Rouslan

votes

1 answer

Why is transforming an array using AVX-512 instructions significantly slower when transforming it in batches of 8 compared to 7 or 9?

Please consider the following minimal example minimal.cpp (https://godbolt.org/z/x7dYes91M). #include #include #include #include #include #include #define NUMBER_OF_TUPLES…

c++ performance clang benchmarking avx512

asked Oct 14 '22 at 12:41

InvisibleShadowGhost

votes

1 answer

Missing AVX-512 intrinsics for masks?

Intel's intrinsics guide lists a number of intrinsics for the AVX-512 K* mask instructions, but there seem to be a few missing: KSHIFT{L/R} KADD KTEST The Intel developer manual claims that intrinsics are not necessary as they are auto generated…

c gcc intrinsics icc avx512

asked Jul 18 '17 at 13:19

zinga

votes

0 answers

AVX512 log2 or pow instructions

I need a AVX512 double pow(double, int n) function (I need it for a binomial distribution calculation which needs to be exact). In particular I would like this for Knights Landing which has AVX512ER. One way to get this is x^n =…

x86 pow xeon-phi avx512

asked Feb 07 '17 at 09:36

Z boson

32,619
11
123
226

votes

1 answer

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best…

c intrinsics avx avx2 avx512

asked Feb 07 '20 at 07:08

thnghh

votes

1 answer

How to emulate _mm256_loadu_epi32 with gcc or clang?

Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32: _m256i _mm256_loadu_epi32 (void const* mem_addr); /* Instruction: vmovdqu32 ymm, m256 CPUID Flags: AVX512VL + AVX512F Description Load 256-bits (composed of 8 packed…

c++ c intrinsics avx512

asked Jan 08 '20 at 15:43

Walter

44,150
20
113
196

votes

2 answers

BMI for generating masks with AVX512

I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations. Here is the code I am…

x86 simd avx512 bmi

asked Feb 21 '19 at 14:15

Z boson

32,619
11
123
226

votes

1 answer

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

The Intel intrinsics guide states simply that _mm512_load_epi32: Load[s] 512-bits (composed of 16 packed 32-bit integers) from memory into dst and that _mm512_load_si512: Load[s] 512-bits of integer data from memory into dst What is the…

x86 sse simd intrinsics avx512

asked Dec 23 '18 at 17:37

Qix - MONICA WAS MISTREATED

14,451
16
82
145

votes

2 answers

invalid register for .seh_savexmm in Cygwin

$ make i have worked with cygwin but got compile error. I am not sure what is invalid register for .seh_savexmm please help me. I searched this problem on google but not find there are many problems but not soultion. Please help me. perl…

gcc assembly cygwin avx512

asked Apr 01 '17 at 03:51

X zheng

1,731
1
17
25

votes

1 answer

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly. I'm writing software that operates on bytes and words…

c byte xeon-phi sse4 avx512

asked Jun 08 '16 at 21:56

user1649948

votes

1 answer

Embedded broadcasts with intrinsics and assembly

In section 2.5.3 "Broadcasts" of the Intel Architecture Instruction Set Extensions Programming Reference the we learn than AVX512 (and Knights Corner) has a bit-field to encode data broadcast for some load-op instructions, i.e. instructions that…

c gcc assembly intrinsics avx512

asked Dec 22 '15 at 11:46

Z boson

32,619
11
123
226

votes

2 answers

Why doesn't Intel design its SIMD ISAs in a more compatible or universal way?

Intel has several SIMD ISAs, such as SSE, AVX, AVX2, AVX-512 and IMCI on Xeon Phi. These ISAs are supported on different processors. For example, AVX-512 BW, AVX-512 DQ and AVX-512 VL are only supported on Skylake, but not on Xeon Phi. AVX-512F,…

intel simd avx avx2 avx512

asked Jul 13 '15 at 09:22

thierry

votes

2 answers

What is meant by "fixing up" floats?

I was looking through the instruction set in AVX-512 and noticed a set of fixup instructions. Some examples: _mm512_fixupimm_pd, _mm512_mask_fixupimm_pd, _mm512_maskz_fixupimm_pd _mm512_fixupimm_round_pd, _mm512_mask_fixupimm_round_pd,…

simd intrinsics avx512

asked May 13 '15 at 11:34

Simon Verbeke

2,905
8
36
55

Prev 1 2

…

23 24 Next