In other words, is it possible to cap autovectorization instructions (obtained with -fast-math -ftree-vectorize
) to something like AVX
while still using AVX512
through explicit intrinsic call?
At the moment,
- without
-mavx512f
, GCC fails saying it cannot compile my program without avx-512f support. Fair enough. - with
-mavx512f
, GCC starts to use it everywhere.
I've not found any options to let GCC use explicit AVX512
intrinsics while limiting itself to something else for auto-vectorization.
Edit: Just to give a bit more context… I have skylake-avx512
Xeon Gold nodes (2 FMA units) and a domain-specific program.
When I compile with -Ofast -march=skylake-avx512 -mtune=skylake-avx512
and run on one core, I get 30% more performance than -march=haswell …
.
When I increase the number of cores to all 24 cores, -march=haswell …
it twice faster than -march=skylake-avx512 …
!
The reason is the infamous core throttling…
But my domain-specific software already includes hand-vectorized parts. I do get a performance win with -fno-tree-vectorize -march=skylake-avx512 …
(but not enough to beat -march=haswell …
with all 24 cores and autovec) therefore autovectorisation is important.
Finally, if I use AVX2
-optimized hand-vectorized kernels with -march=skylake-avx512 …
, I also get crappy performance, therefore I suppose that the expensive part that is inducing the throttling is indeed the auto-vectorization, hence my original question.