Questions tagged [avx512]

AVX512 is Intel's next generation of SIMD instructions that widens vectors to 512-bit, and adds new functionality (masking) and more vector registers.

AVX512 is a set of instruction set extensions for x86 that features 512-bit SIMD vectors.

Wikipedia's AVX-512 article is kept up to date with lists of the sub-extensions, and a handy table of which CPUs support which extensions: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

Other resources:

Overview: Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions
Slides from a talk by Kirill Yukhin, introducing the new features of AVX-512 like masking and embedded-rounding. (With Intel-syntax asm examples.) Includes some use-case examples like conflict-detection for histograms using gather/scatter.
x86 tag wiki for x86 performance info,
especially https://uops.info/ and https://agner.org/optimize/
sse tag wiki for guides to x86 SIMD in general.

AVX512 is broken into the sub-extensions including the following. While all AVX512 implementations are required to support AVX512-F, the rest are optional.

AVX512-F (Foundation)
AVX512-CD (Conflict Detection)
AVX512-ER (Exponential and Reciprocal)
AVX512-PF (Prefetch)
AVX512-BW (Byte and Word instructions)
AVX512-DQ (Double-word and quad-word instructions)
AVX512-VL (Vector Length)
AVX512-IFMA (52-bit Integer Multiply-Add)
AVX512-VBMI (Vector Byte-Manipulation)
AVX512-VPOPCNT (Vector Population Count)
AVX512-4FMAPS (4 x Fused Multiply-Add Single Precision)
AVX512-4VNNIW (4 x Neural Network Instructions)
AVX512-VBMI2 (Vector Byte-Manipulation 2)
AVX512-VNNI (Neural Network Instructions?)
AVX512-BITALG (Bit Algorithms)
AVX512-VAES (Vector AES Instructions)
AVX512-VGFI (Galois Field Arithmetic)
AVX512-VPCLMULQ (Vector Carry-less Multiply)

Supporting Processors:

Intel Xeon Phi Knights Landing: AVX512-(F, CD, ER, PF)
Intel Xeon Phi Knights Mill: AVX512-(F, CD, ER, PF, VPOPCNT, 4FMAPS, 4VNNIW)
Intel Skylake Xeon: AVX512-(F, CD, BW, DQ, VL)
Intel Cannonlake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI)
Intel Ice Lake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI, VPOPCNT, VBMI2, VNNI, BITALG, VAES, VGFI, VPCLMULQ)

Foundation (AVX512-F):

All implementations of AVX512 are required to support AVX512-F. AVX512-F expands AVX by doubling the size of the vector width to 512 bits and double the number of registers to 32. It also provides embedded masking by means of 8 opmask registers.

AVX512-F only supports operations on 32-bit and 64-bit words and only operates on zmm (512-bit) registers.

Conflict Detection (AVX512-CD):

AVx512-CD aids vectorization by providing instructions to detect data conflicts.

Exponential and Reciprocal (AVX512-ER):

AVX512-ER provides instructions for computing the reciprocal and exponential functions with increased accuracy. These are used to aid in the fast computation of trigonometric functions.

Prefetch (AVX512-PF):

AVX512-PF provides instructions for vector gather/scatter prefetching.

Byte and Word (AVX512-BW):

AVX512-BW extends AVX512-F by adding support for byte and word (8/16-bit) operations.

Double-word and Quad-word (AVX512-DQ):

AVX512-DQ extends AVX512-F by providing more instructions for 32-bit and 64-bit data.

Vector-Length (AVX512-VL):

AVX512-VL extends AVX512-F by allowing the full AVX512 functionality to operate on xmm and ymm registers (as opposed to only zmm). This includes the masking as well as the increased register count of 32.

52-bit Integer Multiply-Add (AVX512-IFMA):

AVX512-IFMA provides fused multiply-add instructions for 52-bit integers. (Speculation: likely derived from the floating-point FMA hardware)

Vector Bit-Manipulation (AVX512-VBMI):

AVX512-VBMI provides instructions for byte-permutation. It extends the existing permute instructions to byte-granularity.

Vector Population Count (AVX512-VPOPCNT)

A vectorized version of the popcnt instruction for 32-bit and 64-bit words.

4 x Fused Multiply-Add Single Precision (AVX512-4FMAPS)

AVX512-4FMAPS provides instructions that perform 4 consecutive single-precision FMAs.

Neural Network Instructions (AVX512-4VNNIW)

Specialized instructions on 16-bit integers for Neural Networks. These follow the same "4 consecutive" op instruction format as AVX512-4FMAPS.

Vector Byte-Manipulation 2 (AVX512-VBMI2)

Extends AVX512-VBMI by adding support for compress/expand on byte-granular word sizes.

Neural Network Instructions (AVX512-VNNI)

Specialized instructions for Neural Networks. This is the desktop/Xeon version of AVX512-4VNNIW on Knights Mill Xeon Phi.

Bit Algorithms (AVX512-BITALG)

Extends AVX512-VPOPCNT to word and 8-bit and 16-bit words. Adds additional bit manipulation instructions.

Vector AES Instructions (AVX512-VAES)

Extends the existing AES-NI instructions to 512-bit width.

Galois Field Arithmetic (AVX512-VGFI)

Arithmetic for Galois Fields.

Vector Carry-less Multiply (AVX512-VPCLMULQ)

Vectorized version of the pclmulqdq instruction.

349 questions

votes

0 answers

Is there a penalty for mixing x86-64 integer instructions with AVX1/2/512 instructions?

I have seen a lot of assembly with AVX(all three flavors), and in all the cases that I have seen the most concentrated a kind of instruction is the best the code performs. But, for example, things like doing a load into a 32-bit register and then…

performance x86 avx avx2 avx512

asked Feb 19 '18 at 07:52

JLV

votes

0 answers

When can I call xsaves and xsaves64?

When is it allowed to call xsaves and xsaves64? Using Intel Software Development Emulator (8.12.0-2017-10-23), I can use xsaves64 + xrstors64 from user-space without any problems, but trying to use xsaves + xrstors produces: Illegal instruction at…

assembly avx2 context-switch avx512 fxsave

asked Nov 21 '17 at 18:18

gnzlbg

7,135
5
53
106

votes

1 answer

AVX-512 extensions supported on new Skylake-X (Core i9, 79xxX/XE) CPUs

AVX-512 standard consists of many extensions, and only one (AVX-512F) is mandatory. What exactly is supported by new Skylake-X (Core i9, 79xxX/XE) CPUs? Wikipedia page about AVX has details about Skylake Xeon CPUs (E5-26xx V5), but not about i9.…

avx512

asked Nov 05 '17 at 14:14

Daniel Frużyński

2,091
19
28

votes

0 answers

AVX determine number of written values

I have a 512 bit wide vector register (16 values) and a mask to store them to memory using _mm512_mask_i32scatter_epi32(). To determine how many values are written to memory I count the leading zeroes of the mask using __builtin_clz(). If the mask…

gcc avx512

asked Aug 27 '17 at 12:40

Hymir

votes

1 answer

How to detect a Xeon Phi (Knights Landing)

Intel engineers wrote that we should use VZEROUPPER/VZEROALL to avoid costly transition to non-VEX state on all processors, including future Xeon processor, but not on Xeon Phi: https://software.intel.com/pt-br/node/704023 People have also measured…

avx avx2 xeon-phi avx512 knights-landing

asked Jun 09 '17 at 20:12

Maxim Masiutin

3,991
4
55
72

votes

0 answers

Why this vectorization fails on AVX-512 and not on AVX2?

I have this code which I test on my AVX2 machine: bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res) { bool ret = false; // input size (-1 for the safe bilinear…

c++ parallel-processing vectorization avx2 avx512

asked May 19 '17 at 11:23

justHelloWorld

6,478
8
58
138

votes

1 answer

AVX512 Performance Recommendation on Bit Test and Operation

I've met a set of code with the following "kernel" as performance blocker. Since I have access to the latest Intel(R) Xeon Phi(TM) CPU 7210 (KNL), I wish to speed it up using AVX512 intrinsic. for( int y = starty; y <= endy; y++) { // hence…

c xeon-phi avx512

asked Nov 02 '16 at 02:16

veritas

votes

1 answer

Intel MIC - sum of intrinsic vector elements

I have a __m512d intrinsic vector and I need sum of his elements. Is there any easy way to do this? I am concentrated on a performance of computation, so i need to do this operation quickly. My knowledge about intrinsic is not enough to do it…

c++ simd intel-mic avx512

asked Nov 26 '15 at 18:17

JudgeDeath

votes

1 answer

Undefined reference in AVX-512

I have a C code that runs on Xeon Phi, containing many AVX-512 intrinsics. The code compiles well, until the following lines: #ifdef __MIC__ __m512i mm_idx = _mm512_set_epi32(0, 0, 0, 0, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); __m512 mm_temp1 =…

c intrinsics icc avx512

asked Mar 31 '15 at 13:29

Zack

-1

votes

0 answers

How to transpose a matrix using AVX-512?

I'm trying to use avx-512 to do matrix transpose. But the matrix can't perfectly transposed, it seems to have memory address problems. I think the problem is related to the memory address part, such as the [j * rowA+i ] in mt code. Hope someone can…

c matrix avx512

asked Aug 16 '23 at 06:47

rcheni

-1

votes

0 answers

Usage of _mm_loadu_epi8 leads to error - ‘_mm_loadu_epi8’ was not declared in this scope

While trying to load _mm_loadu_epi8 instruction which is defined in AVX512 family of Intel Intrinsics instruction was getting error in c++ that - Usage of _mm_loadu_epi8 leads to error - ‘_mm_loadu_epi8’ was not declared in this scope. Tried to use…

c++ intrinsics avx avx512

asked Aug 14 '23 at 10:51

Srihari S

-1

votes

1 answer

Ideas to speeden up the given AVX512 code

mask_new1 = _mm512_set_epi32(0, 3, 0, 3, 0, 3, 0, 3, 0, 2, 0, 2, 0, 2, 0, 2); s1 = _mm512_permutexvar_pd(mask_new1, r); out1 = _mm512_mul_pd(s1, _mm512_mul_pd(b1, c1)); Are there any ways/ideas to perform faster the permute operation…

c++ performance optimization intrinsics avx512

asked Mar 22 '23 at 01:37

Srihari S

-1

votes

1 answer

How to prevent GCC from generating x86_64 kmov instructions?

I am using a third party x86_64 assembler that do not recognise kmov instructions (it only supports a subset of the x86_64 instruction set). The assembler is fed assembly files generated by GCC 5.1 (I can't change the version), then it parses and…

assembly gcc x86-64 avx512 gcc5

asked Nov 11 '21 at 05:35

Karim Manaouil

1,177
10
24

-1

votes

3 answers

Illegal Instruction with mm_cmpeq_epi8_mask

Im trying to run code similar to the following #include void foo() { __m128i a = _mm_set_epi8 (0,0,6,5,4,3,2,1,8,7,6,5,4,3,2,1); __m128i b = _mm_set_epi8 (0,0,0,0,0,0,0,1,8,7,6,5,4,3,2,1); __mmask16 m =…

gcc intrinsics instruction-set compiler-flags avx512

asked Jun 25 '19 at 02:10

tarashir342

-1

votes

1 answer

AVX-512 instructions library in VS2008

I have a C++ library built in Visual Studio 2017 which uses AVX-512 intrinsics. I need to link the library to VS2008 C++ code. The library is used to extract lines from an image. All the intrinsic instructions are encapsulated within the library.…

c++ visual-studio-2008 x86 linker avx512

asked Sep 27 '18 at 15:02

ibrodskiy

Prev 1 2 3

…

24 Next