1

I'm reading this document about how to compile C/C++ code using the Intel C++ compiler and AVX512 support on a Intel Knights Landing.

However, I'm a little bit confused about this part:

-xMIC-AVX512: use this option to generate AVX-512F, AVX-512CD, AVX-512ER and AVX-512FP.

-xCORE-AVX512: use this option to generate AVX-512F, AVX-512CD, AVX-512BW, AVX-512DQ and AVX-512VL.

For example, to generate Intel AVX-512 instructions for the Intel Xeon Phi processor x200, you should use the option –xMIC-AVX512. For example, on a Linux system

$ icc –xMIC-AVX512 application.c This compiler option is useful when you want to build a huge binary for the Intel Xeon Phi processor x200. Instead of building it on the coprocessor where it will take more time, build it on an Intel Xeon processor-based machine

My Xeon Phi KNL doesn't have a coprocessor (No need to ssh micX or to compile with the -mmic flag). However, I don't understand if it's better to use the -xMIC or -xCORE?

In second place about -ax instead of -x:

This compiler option is useful when you try to build a binary that can run on multiple platforms.

So -ax is used for cross-platform support, but is there any performance difference comapred to -x?

Ilya Verbin
  • 647
  • 5
  • 20
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138

2 Answers2

2

For the first question, please use –xMIC-AVX512 if you want to compile for the Intel Xeon Phi processor x200 (aka KNL processor). Note that the phrase in the paper that you mentioned was mistyped, it should read "This compiler option is useful when you want to build a huge binary for the Intel Xeon Phi processor x200. Instead of building it on the Intel Xeon Phi processor x200 where it will take more time, build it on an Intel Xeon processor-based machine."

For the second question, there should not be a performance difference if you run the binaries on an Intel Xeon Phi processor x200. However, the size of the binary complied with -ax should be bigger than the one compiled with -x option.

  • I'm sorry but I'm a little bit confused, what is an "Intel Xeon processor based machine"? As I told you, in my case I don't have a co-processor which has to be ssh-ed and where to run the code, I only ssh the "main" machine, compile and run the code on it. – justHelloWorld Feb 22 '17 at 09:28
  • @justHelloWorld, "Intel Xeon processor based machine" means a non KNL machine with the Intel compiler. You could build you executable on another machine and then copy the binary to the KNL machine. The point is that the cores of a KNL machine are low power and low frequency (OTOH you have a lot of them) so it may be faster to build on another machine and copy the binary over. – Z boson Mar 06 '17 at 08:31
2

Another option from the link you provide is to build with -xCOMMON-AVX512. This is a tempting option because in my case it has all the instructions that I need and I can use the same option for both a KNL and a Sklake-AVX512 system. Since I don't build on a KNL system I cannot use -xHost (or -march=native with GCC).

However, -xCOMMON-AVX512 should NOT be used with KNL. The reason is that it generates the vzeroupper instruction (https://godbolt.org/z/PgFX55) which is not only not necessary it actually is very slow on a KNL system.

From Agner Fog's micro-architecture manual he writes in the KNL section.

The VZEROALL or VZEROUPPER instructions are not only superfluous here, they are actually harmful for the performance. A VZEROALL or VZEROUPPER instruction takes 36 clock cycles in 64 bit mode...

Therefore for a KNL system you should use -xMIC-AVX512for other systems with AVX512 you should use -xCORE-AVX512 (or -xSKYLAKE-AVX512). I use -qopt-zmm-usage=high as well.

I am not aware of a switch for ICC to disable vzeroupper once it is enabled (with GCC you can use -mno-vzeroupper).

Incidentally, by the same logic you should use -march=knl with GCC and not -mavx512f (-mavx512f -mno-vzeroupper may work if you are sure you don't need AVX512ER or AVX512PF).

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    Yup, ISA compatibility is not the only factor. Tuning for KNL can be quite different from tuning for SKX. IDK if ICC supports `-xCOMMON-AVX512 -mtune=knl` to make a binary that can run on either CPU, but is only tuned for KNL. If so, you still might not want to run it on SKX because of false dependencies from omitting vzeroupper. – Peter Cordes Apr 26 '19 at 19:18
  • @PeterCordes, I tried `-xCOMMON-AVX512 -mtune=knl` in godbolt. It still produces `vzeroupper`. Anyway, I think I have learned my lesson finally about the importance of tunning and not just compatibility. – Z boson Apr 29 '19 at 08:27
  • @PeterCordes, interesting `march=skylake-avx512` defines `-mprefer-vector-width=256` but `-mavx512f` does not. ICC does the same thing ( `-xCOMMON-AVX512` uses 512-bit vectors but `-xCORE-AVX512` uses 256-bit. So by default the vector width is not defined or restricted. It's the tunning to an architecture that does this. – Z boson Apr 29 '19 at 09:54
  • `-mavx512f` doesn't include `-mavx512vl`, so even if the default tuning is `-mprefer-vector-width=256`, it's not available for EVEX encoded instructions. Although compilers could have chosen not to use AVX512 features at all with tune=generic and only AVX512F available, if AVX2 was available. – Peter Cordes Apr 29 '19 at 17:26
  • 1
    @PeterCordes, good point, I am used not not having AVX512VL and have not found it a hindrance. In anycase `-mavx512vl` does default to 256-bit vectors either `gcc -mavx512vl -Q --help=target | grep prefer-vector-width`. – Z boson Apr 30 '19 at 07:21