4

In other words, is it possible to cap autovectorization instructions (obtained with -fast-math -ftree-vectorize) to something like AVX while still using AVX512 through explicit intrinsic call?

At the moment,

  • without -mavx512f, GCC fails saying it cannot compile my program without avx-512f support. Fair enough.
  • with -mavx512f, GCC starts to use it everywhere.

I've not found any options to let GCC use explicit AVX512 intrinsics while limiting itself to something else for auto-vectorization.


Edit: Just to give a bit more context… I have skylake-avx512 Xeon Gold nodes (2 FMA units) and a domain-specific program.

When I compile with -Ofast -march=skylake-avx512 -mtune=skylake-avx512 and run on one core, I get 30% more performance than -march=haswell ….

When I increase the number of cores to all 24 cores, -march=haswell … it twice faster than -march=skylake-avx512 …!

The reason is the infamous core throttling…

But my domain-specific software already includes hand-vectorized parts. I do get a performance win with -fno-tree-vectorize -march=skylake-avx512 … (but not enough to beat -march=haswell … with all 24 cores and autovec) therefore autovectorisation is important.

Finally, if I use AVX2-optimized hand-vectorized kernels with -march=skylake-avx512 …, I also get crappy performance, therefore I suppose that the expensive part that is inducing the throttling is indeed the auto-vectorization, hence my original question.

  • `-mprefer-vector-width=`. By the way, instead of `-mavx512f`, I recommend you use `-march=...` so gcc has a better idea how efficient the AVX512 instructions are. – Marc Glisse Jun 08 '19 at 17:45
  • GCC's default for `tune=generic` or `skylake-avx512` is still `-mprefer-vector-width=256`, so in most code a lot of the instructions will still be VEX-coded. Except when gcc is dumb and uses `vmovdqu64 ymm0` when it could have used `vmovdqu ymm`. For FP code the AVX512 versions don't have different mnemonics. This doesn't help if you need to make *sure* it doesn't use more than AVX2 + FMA + BMI2 + whatever (`-march=haswell`) for some functions, though. – Peter Cordes Jun 08 '19 at 17:47
  • TL:DR: you can't, that's not how GCC works. Target ISA options are on a per-function level, regardless of whether you use intrinsics or not. See [How to turn on -mavx2 for only particular part of source code?](//stackoverflow.com/q/56466300) (and note the comments: make sure you do this on a large enough function, e.g. containing a hot loop not called in a hot loop) – Peter Cordes Jun 08 '19 at 18:16
  • @MarcGlisse, of course I'm actually using `-march=…` (actually targeting `skylake-avx512`): `-mavx512f` was just to keep everything short and sweet. – user11488411 Jun 09 '19 at 13:40
  • @PeterCordes technically speaking most of 256-bit vector instructions in AVX512 can be used in predicated form and so have to be EVEX-encoded. – Anton Jun 29 '19 at 06:11
  • @Anton: Notice that I wrote `vmovdqu ymm`, not `vmovdqu ymm{k1}{z}`. Obviously if you actually *do* use masking on an instruction, you need the EVEX encoding. But very often you don't. Fortunately for many instructions, there's no separate mnemonic and the assembler picks the shortest encoding out of VEX vs. EVEX for an instruction like `vpaddb ymm1, ymm2, ymm3`, but integer vector move or bitwise booleans (no element width until AVX512) do have new mnemonics. – Peter Cordes Jun 29 '19 at 06:16
  • @Anton: Any *new* AVX512-only instructions also require EVEX; unfortunately AVX512VL didn't add VEX short encodings of instructions like `vpternlogd` for use with the low 16 registers (not ymm16..31); there's lots of VEX opcode coding space so it could have been done. There are already lots of instructions available via VEX or EVEX; the decoders presumably have efficient mechanisms for handling that without costing too many more transistors. – Peter Cordes Jun 29 '19 at 06:20
  • @PeterCordes My point is compilers can and should utilize predicated instructions if profitable regardless of vector size used. Considering a code complex enough there inevitably be conditionally executed pieces that can be converted to the sequence of mask construction and predicated instructions. Of course, if a pre-AVX512 instruction has to be replaced with semantically equal AVX512 counterpart there is no point in masking but compilers are supposed to be sophisticated things and can do more than just this. :-) – Anton Jun 29 '19 at 08:46
  • 1
    There isn't any gold model where AVX turbo scaling would explain a 1.3x perf advantage turning into a 2x deficit: not even close, so AVX-512 scaling is not the primary cause of your observation. – BeeOnRope Jun 29 '19 at 15:42
  • @Anton: yes AVX512VL for 256-bit masked instructions are usually better than AVX1 / AVX2 `vblendvps`. But many SIMD loops don't benefit from masking, and don't need conditional loads/stores. Until the recent update to the question, it was totally unclear that OP simply wanted to avoid 512-bit vectors, without needing to avoid EVEX prefixes. – Peter Cordes Jun 29 '19 at 17:58

1 Answers1

1

You can use the target attribute to enable instructions on a per-function basis, allowing you to call intrinsics which would otherwise not be allowed.

I'm guessing you want to switch between implementations of certain functioons based on the CPU's capabilities as determined at runtime... If so, you may want to take a look at the target_clones attribute as well.

nemequ
  • 16,623
  • 1
  • 43
  • 62
  • I wanted to avoid changing the code. However this confirms that I have indeed missed nothing in the doc and that capping autovectorisation level is not implemented at the command-line level. – user11488411 Jun 09 '19 at 13:44