Questions tagged [avx512]

AVX512 is Intel's next generation of SIMD instructions that widens vectors to 512-bit, and adds new functionality (masking) and more vector registers.

AVX512 is a set of instruction set extensions for x86 that features 512-bit SIMD vectors.

Wikipedia's AVX-512 article is kept up to date with lists of the sub-extensions, and a handy table of which CPUs support which extensions: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

Other resources:


AVX512 is broken into the sub-extensions including the following. While all AVX512 implementations are required to support AVX512-F, the rest are optional.

  • AVX512-F (Foundation)
  • AVX512-CD (Conflict Detection)
  • AVX512-ER (Exponential and Reciprocal)
  • AVX512-PF (Prefetch)
  • AVX512-BW (Byte and Word instructions)
  • AVX512-DQ (Double-word and quad-word instructions)
  • AVX512-VL (Vector Length)
  • AVX512-IFMA (52-bit Integer Multiply-Add)
  • AVX512-VBMI (Vector Byte-Manipulation)
  • AVX512-VPOPCNT (Vector Population Count)
  • AVX512-4FMAPS (4 x Fused Multiply-Add Single Precision)
  • AVX512-4VNNIW (4 x Neural Network Instructions)
  • AVX512-VBMI2 (Vector Byte-Manipulation 2)
  • AVX512-VNNI (Neural Network Instructions?)
  • AVX512-BITALG (Bit Algorithms)
  • AVX512-VAES (Vector AES Instructions)
  • AVX512-VGFI (Galois Field Arithmetic)
  • AVX512-VPCLMULQ (Vector Carry-less Multiply)

Supporting Processors:

  • Intel Xeon Phi Knights Landing: AVX512-(F, CD, ER, PF)
  • Intel Xeon Phi Knights Mill: AVX512-(F, CD, ER, PF, VPOPCNT, 4FMAPS, 4VNNIW)
  • Intel Skylake Xeon: AVX512-(F, CD, BW, DQ, VL)
  • Intel Cannonlake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI)
  • Intel Ice Lake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI, VPOPCNT, VBMI2, VNNI, BITALG, VAES, VGFI, VPCLMULQ)

Foundation (AVX512-F):

All implementations of AVX512 are required to support AVX512-F. AVX512-F expands AVX by doubling the size of the vector width to 512 bits and double the number of registers to 32. It also provides embedded masking by means of 8 opmask registers.

AVX512-F only supports operations on 32-bit and 64-bit words and only operates on zmm (512-bit) registers.

Conflict Detection (AVX512-CD):

AVx512-CD aids vectorization by providing instructions to detect data conflicts.

Exponential and Reciprocal (AVX512-ER):

AVX512-ER provides instructions for computing the reciprocal and exponential functions with increased accuracy. These are used to aid in the fast computation of trigonometric functions.

Prefetch (AVX512-PF):

AVX512-PF provides instructions for vector gather/scatter prefetching.

Byte and Word (AVX512-BW):

AVX512-BW extends AVX512-F by adding support for byte and word (8/16-bit) operations.

Double-word and Quad-word (AVX512-DQ):

AVX512-DQ extends AVX512-F by providing more instructions for 32-bit and 64-bit data.

Vector-Length (AVX512-VL):

AVX512-VL extends AVX512-F by allowing the full AVX512 functionality to operate on xmm and ymm registers (as opposed to only zmm). This includes the masking as well as the increased register count of 32.

52-bit Integer Multiply-Add (AVX512-IFMA):

AVX512-IFMA provides fused multiply-add instructions for 52-bit integers. (Speculation: likely derived from the floating-point FMA hardware)

Vector Bit-Manipulation (AVX512-VBMI):

AVX512-VBMI provides instructions for byte-permutation. It extends the existing permute instructions to byte-granularity.

Vector Population Count (AVX512-VPOPCNT)

A vectorized version of the popcnt instruction for 32-bit and 64-bit words.

4 x Fused Multiply-Add Single Precision (AVX512-4FMAPS)

AVX512-4FMAPS provides instructions that perform 4 consecutive single-precision FMAs.

Neural Network Instructions (AVX512-4VNNIW)

Specialized instructions on 16-bit integers for Neural Networks. These follow the same "4 consecutive" op instruction format as AVX512-4FMAPS.

Vector Byte-Manipulation 2 (AVX512-VBMI2)

Extends AVX512-VBMI by adding support for compress/expand on byte-granular word sizes.

Neural Network Instructions (AVX512-VNNI)

Specialized instructions for Neural Networks. This is the desktop/Xeon version of AVX512-4VNNIW on Knights Mill Xeon Phi.

Bit Algorithms (AVX512-BITALG)

Extends AVX512-VPOPCNT to word and 8-bit and 16-bit words. Adds additional bit manipulation instructions.

Vector AES Instructions (AVX512-VAES)

Extends the existing AES-NI instructions to 512-bit width.

Galois Field Arithmetic (AVX512-VGFI)

Arithmetic for Galois Fields.

Vector Carry-less Multiply (AVX512-VPCLMULQ)

Vectorized version of the pclmulqdq instruction.

349 questions
11
votes
3 answers

What are the AVX-512 Galois-field-related instructions for?

One of the AVX-512 instruction set extensions is AVX-512 + GFNI, " Galois Field New Instructions". Galois theory is about field extensions. What does that have to do with processing vectorized integer or floating-point values? The instructions…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
11
votes
2 answers

GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org. #include __m512i add(__m512i a, __m512i b) { __m512i sum; asm( "mov ebx, 0xAAAAAAAA; \n\t" …
tert
  • 113
  • 6
11
votes
0 answers

What's the difference between the XOR instructions "VPXORD", "VXORPS" and "VXORPD" in Intel's AVX2

I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXORPS" However, per my understanding, they should all…
Harper
  • 1,794
  • 14
  • 31
11
votes
1 answer

What are the differences between the compress and expand instructions in AVX-512?

I was studying the expand and compress operations from the Intel intrinsics guide. I'm confused about these two concepts: For __m128d _mm_mask_expand_pd (__m128d src, __mmask8 k, __m128d a) == vexpandpd Load contiguous active double-precision…
Amiri
  • 2,417
  • 1
  • 15
  • 42
11
votes
1 answer

Fallback implementation for conflict detection in AVX2

AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2? I'm not interested in the extact bits I just…
Christoph Diegelmann
  • 2,004
  • 15
  • 26
10
votes
1 answer

Performance of AVX-512 masked memory accesses

Can masking improve the performance of AVX-512 memory operations (load/store/gather/scatter and non-shuffling load-ops)? Seeing as masked out elements don't trigger memory faults, one would assume that masking helps performance in those cases,…
zinga
  • 769
  • 7
  • 17
10
votes
1 answer

Does vzeroall zero registers ymm16 to ymm31?

The documentation for vzeroall appears inconsistent. The prose says: The instruction zeros contents of all XMM or YMM registers. The pseudocode below that, however, indicates that in 64-bit mode only registers ymm0 through ymm15 are affected: IF…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
10
votes
1 answer

Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

Writing a ZMM register can leave a Skylake-X (or similar) CPU in a state of reduced max-turbo indefinitely. (SIMD instructions lowering CPU frequency and Dynamically determining where a rogue AVX-512 instruction is executing) Presumably Ice Lake…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
10
votes
2 answers

Truth-table reduction to ternary logic operations, vpternlog

I have many truth-tables of many variables (7 or more) and I use a tool (eg logic friday 1) to simplify the logic formula. I could do that by hand but that is much too error prone. These formula I then translate to compiler intrinsics (eg…
HJLebbink
  • 719
  • 1
  • 11
  • 32
10
votes
1 answer

What is the penalty of mixing EVEX and VEX encoded scheme?

It is a known issue that mixing VEX-encoded instructions and non-VEX instructions has a penalty and the programmer must be aware of it. There are some questions and answers like this. The solutions are depended on the way you program (usually you…
Amiri
  • 2,417
  • 1
  • 15
  • 42
9
votes
0 answers

Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array. On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd with AVX512 or vpmovmskb with AVX256. When I use…
He3lixxx
  • 3,263
  • 1
  • 12
  • 31
9
votes
3 answers

Count leading zero bits for each element in AVX2 vector, emulate _mm256_lzcnt_epi32

With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element. Is there an efficient way to implement this using AVX and…
tmlen
  • 8,533
  • 5
  • 31
  • 84
9
votes
2 answers

Counting 1 bits (population count) on large data using AVX-512 or AVX-2

I have a long chunk of memory, say, 256 KiB or longer. I want to count the number of 1 bits in this entire chunk, or in other words: Add up the "population count" values for all bytes. I know that AVX-512 has a VPOPCNTDQ instruction which counts the…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
9
votes
1 answer

Do 128bit cross lane operations in AVX512 give better performance?

In designing forward looking algorithms for AVX256, AVX512 and one day AVX1024 and considering the potential implementation complexity/cost of fully generic permutes for large SIMD width I wondered if it is better to generally keep to isolated…
iam
  • 1,623
  • 1
  • 14
  • 28
8
votes
1 answer

What is the granularity of "masked" stores in AVX512?

Lets say you call _mm512_mask_store_ps, from the point of view of the CPU's write buffer, is it executed as a store of size 64-bytes (with some sort of masking) or is it executed internally as multiple stores of size 4-bytes? In order to prevent…
user2059893
  • 409
  • 3
  • 10
1
2
3
23 24