Questions tagged [avx512]

AVX512 is Intel's next generation of SIMD instructions that widens vectors to 512-bit, and adds new functionality (masking) and more vector registers.

AVX512 is a set of instruction set extensions for x86 that features 512-bit SIMD vectors.

Wikipedia's AVX-512 article is kept up to date with lists of the sub-extensions, and a handy table of which CPUs support which extensions: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

Other resources:

Overview: Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions
Slides from a talk by Kirill Yukhin, introducing the new features of AVX-512 like masking and embedded-rounding. (With Intel-syntax asm examples.) Includes some use-case examples like conflict-detection for histograms using gather/scatter.
x86 tag wiki for x86 performance info,
especially https://uops.info/ and https://agner.org/optimize/
sse tag wiki for guides to x86 SIMD in general.

AVX512 is broken into the sub-extensions including the following. While all AVX512 implementations are required to support AVX512-F, the rest are optional.

AVX512-F (Foundation)
AVX512-CD (Conflict Detection)
AVX512-ER (Exponential and Reciprocal)
AVX512-PF (Prefetch)
AVX512-BW (Byte and Word instructions)
AVX512-DQ (Double-word and quad-word instructions)
AVX512-VL (Vector Length)
AVX512-IFMA (52-bit Integer Multiply-Add)
AVX512-VBMI (Vector Byte-Manipulation)
AVX512-VPOPCNT (Vector Population Count)
AVX512-4FMAPS (4 x Fused Multiply-Add Single Precision)
AVX512-4VNNIW (4 x Neural Network Instructions)
AVX512-VBMI2 (Vector Byte-Manipulation 2)
AVX512-VNNI (Neural Network Instructions?)
AVX512-BITALG (Bit Algorithms)
AVX512-VAES (Vector AES Instructions)
AVX512-VGFI (Galois Field Arithmetic)
AVX512-VPCLMULQ (Vector Carry-less Multiply)

Supporting Processors:

Intel Xeon Phi Knights Landing: AVX512-(F, CD, ER, PF)
Intel Xeon Phi Knights Mill: AVX512-(F, CD, ER, PF, VPOPCNT, 4FMAPS, 4VNNIW)
Intel Skylake Xeon: AVX512-(F, CD, BW, DQ, VL)
Intel Cannonlake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI)
Intel Ice Lake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI, VPOPCNT, VBMI2, VNNI, BITALG, VAES, VGFI, VPCLMULQ)

Foundation (AVX512-F):

All implementations of AVX512 are required to support AVX512-F. AVX512-F expands AVX by doubling the size of the vector width to 512 bits and double the number of registers to 32. It also provides embedded masking by means of 8 opmask registers.

AVX512-F only supports operations on 32-bit and 64-bit words and only operates on zmm (512-bit) registers.

Conflict Detection (AVX512-CD):

AVx512-CD aids vectorization by providing instructions to detect data conflicts.

Exponential and Reciprocal (AVX512-ER):

AVX512-ER provides instructions for computing the reciprocal and exponential functions with increased accuracy. These are used to aid in the fast computation of trigonometric functions.

Prefetch (AVX512-PF):

AVX512-PF provides instructions for vector gather/scatter prefetching.

Byte and Word (AVX512-BW):

AVX512-BW extends AVX512-F by adding support for byte and word (8/16-bit) operations.

Double-word and Quad-word (AVX512-DQ):

AVX512-DQ extends AVX512-F by providing more instructions for 32-bit and 64-bit data.

Vector-Length (AVX512-VL):

AVX512-VL extends AVX512-F by allowing the full AVX512 functionality to operate on xmm and ymm registers (as opposed to only zmm). This includes the masking as well as the increased register count of 32.

52-bit Integer Multiply-Add (AVX512-IFMA):

AVX512-IFMA provides fused multiply-add instructions for 52-bit integers. (Speculation: likely derived from the floating-point FMA hardware)

Vector Bit-Manipulation (AVX512-VBMI):

AVX512-VBMI provides instructions for byte-permutation. It extends the existing permute instructions to byte-granularity.

Vector Population Count (AVX512-VPOPCNT)

A vectorized version of the popcnt instruction for 32-bit and 64-bit words.

4 x Fused Multiply-Add Single Precision (AVX512-4FMAPS)

AVX512-4FMAPS provides instructions that perform 4 consecutive single-precision FMAs.

Neural Network Instructions (AVX512-4VNNIW)

Specialized instructions on 16-bit integers for Neural Networks. These follow the same "4 consecutive" op instruction format as AVX512-4FMAPS.

Vector Byte-Manipulation 2 (AVX512-VBMI2)

Extends AVX512-VBMI by adding support for compress/expand on byte-granular word sizes.

Neural Network Instructions (AVX512-VNNI)

Specialized instructions for Neural Networks. This is the desktop/Xeon version of AVX512-4VNNIW on Knights Mill Xeon Phi.

Bit Algorithms (AVX512-BITALG)

Extends AVX512-VPOPCNT to word and 8-bit and 16-bit words. Adds additional bit manipulation instructions.

Vector AES Instructions (AVX512-VAES)

Extends the existing AES-NI instructions to 512-bit width.

Galois Field Arithmetic (AVX512-VGFI)

Arithmetic for Galois Fields.

Vector Carry-less Multiply (AVX512-VPCLMULQ)

Vectorized version of the pclmulqdq instruction.

349 questions

votes

3 answers

What are the AVX-512 Galois-field-related instructions for?

One of the AVX-512 instruction set extensions is AVX-512 + GFNI, " Galois Field New Instructions". Galois theory is about field extensions. What does that have to do with processing vectorized integer or floating-point values? The instructions…

avx512 galois-field

asked Dec 01 '19 at 10:39

einpoklum

118,144
57
340
684

votes

2 answers

GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org. #include __m512i add(__m512i a, __m512i b) { __m512i sum; asm( "mov ebx, 0xAAAAAAAA; \n\t" …

c gcc assembly inline-assembly avx512

asked May 02 '19 at 05:23

tert

votes

0 answers

What's the difference between the XOR instructions "VPXORD", "VXORPS" and "VXORPD" in Intel's AVX2

I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXORPS" However, per my understanding, they should all…

x86 cpu-architecture avx avx2 avx512

asked Mar 05 '19 at 18:32

Harper

1,794
14
31

votes

1 answer

What are the differences between the compress and expand instructions in AVX-512?

I was studying the expand and compress operations from the Intel intrinsics guide. I'm confused about these two concepts: For __m128d _mm_mask_expand_pd (__m128d src, __mmask8 k, __m128d a) == vexpandpd Load contiguous active double-precision…

assembly x86 simd avx512

asked Jul 09 '18 at 09:44

Amiri

2,417
1
15
42

votes

1 answer

Fallback implementation for conflict detection in AVX2

AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2? I'm not interested in the extact bits I just…

c++ x86 intrinsics avx2 avx512

asked Jun 30 '17 at 09:47

Christoph Diegelmann

2,004
15
26

votes

1 answer

Performance of AVX-512 masked memory accesses

Can masking improve the performance of AVX-512 memory operations (load/store/gather/scatter and non-shuffling load-ops)? Seeing as masked out elements don't trigger memory faults, one would assume that masking helps performance in those cases,…

performance x86 cpu-architecture avx512

asked Aug 10 '22 at 10:18

zinga

votes

1 answer

Does vzeroall zero registers ymm16 to ymm31?

The documentation for vzeroall appears inconsistent. The prose says: The instruction zeros contents of all XMM or YMM registers. The pseudocode below that, however, indicates that in 64-bit mode only registers ymm0 through ymm15 are affected: IF…

assembly x86 intel avx avx512

asked Jan 24 '20 at 19:18

BeeOnRope

60,350
16
207
386

votes

1 answer

Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

Writing a ZMM register can leave a Skylake-X (or similar) CPU in a state of reduced max-turbo indefinitely. (SIMD instructions lowering CPU frequency and Dynamically determining where a rogue AVX-512 instruction is executing) Presumably Ice Lake…

assembly x86 intel micro-optimization avx512

asked Oct 26 '19 at 06:14

Peter Cordes

328,167
45
605
847

votes

2 answers

Truth-table reduction to ternary logic operations, vpternlog

I have many truth-tables of many variables (7 or more) and I use a tool (eg logic friday 1) to simplify the logic formula. I could do that by hand but that is much too error prone. These formula I then translate to compiler intrinsics (eg…

boolean-logic intrinsics truthtable avx512

asked Nov 28 '17 at 17:28

HJLebbink

votes

1 answer

What is the penalty of mixing EVEX and VEX encoded scheme?

It is a known issue that mixing VEX-encoded instructions and non-VEX instructions has a penalty and the programmer must be aware of it. There are some questions and answers like this. The solutions are depended on the way you program (usually you…

assembly x86 simd avx512

asked Sep 06 '17 at 16:35

Amiri

2,417
1
15
42

votes

0 answers

Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array. On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd with AVX512 or vpmovmskb with AVX256. When I use…

c++ clang compiler-optimization avx2 avx512

asked Jan 06 '23 at 12:21

He3lixxx

3,263
1
12
31

votes

3 answers

Count leading zero bits for each element in AVX2 vector, emulate _mm256_lzcnt_epi32

With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element. Is there an efficient way to implement this using AVX and…

bit-manipulation simd avx avx2 avx512

asked Nov 12 '19 at 16:46

tmlen

8,533
5
31
84

votes

2 answers

Counting 1 bits (population count) on large data using AVX-512 or AVX-2

I have a long chunk of memory, say, 256 KiB or longer. I want to count the number of 1 bits in this entire chunk, or in other words: Add up the "population count" values for all bytes. I know that AVX-512 has a VPOPCNTDQ instruction which counts the…

assembly avx2 avx512 bitcount population-count

asked Apr 28 '18 at 22:04

einpoklum

118,144
57
340
684

votes

1 answer

Do 128bit cross lane operations in AVX512 give better performance?

In designing forward looking algorithms for AVX256, AVX512 and one day AVX1024 and considering the potential implementation complexity/cost of fully generic permutes for large SIMD width I wondered if it is better to generally keep to isolated…

performance x86 intel avx avx512

asked Dec 05 '17 at 04:43

iam

1,623
1
14
28

votes

1 answer

What is the granularity of "masked" stores in AVX512?

Lets say you call _mm512_mask_store_ps, from the point of view of the CPU's write buffer, is it executed as a store of size 64-bytes (with some sort of masking) or is it executed internally as multiple stores of size 4-bytes? In order to prevent…

performance assembly intel cpu-architecture avx512

asked Sep 03 '20 at 20:47

user2059893

Prev 1

…

23 24 Next