1

I am doing some profiling and performance is important for me (even 5%). The processor is Intel Xeon Platinum 8280 ("Cascade Lake") on Frontera. I compile my code with -Ofast flag, in Release mode. When I add -march=cascadelake, the timing gets worse (5-6%) in my test case. The same is true if use -xCORE-AVX512 instead of march. I am using icpc 19.1.1.217. Can anyone please explain why? Also, what compilation flags do you suggest for better performance?

Edit 1: I am solving a linear system, which consists of different operations, such as dot-product and matrix-vector product. So, it would hard for me to provide reproducible code, but I can say that there are multiple loops in my code that the compiler can apply auto-vectorization. I have used Intel Optimization reports on the critical loops in my code and the report mentioned potential speedups of at least 1.75 for them (for some of the loops it was over 5x potential speedup). I have also used aligned_alloc(64, size) to allocate aligned memory with 64-alignment as this processor supports AVX512. Also, I round up the size to be a multiple of 64.

I have added OpenMP support to my code and have parallelized some loops, but for these experiments that I am reporting, I am using only 1 OpenMP thread.

I have tried -mavx2, and I got the same result as -xCORE-AVX512. I have used -O3 instead of -Ofast. I did not get any speed-up.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
Abaris
  • 181
  • 1
  • 10
  • What does the compiler documentation say about the flag and what optimizations are performed because of it? Have you compared the difference in assembly (the -s flag)? – Martin York Apr 20 '21 at 19:41
  • 3
    This is not uncommon, but without the code this is hard to know where the problem could come from. We can just guess it. Can you provide a minimal reproducible example? Furthermore, do you have the same issue with `-mavx2`? Is the code multi-threaded? Please answers this by updating your question. – Jérôme Richard Apr 20 '21 at 19:45
  • 2
    Also `-Ofast` might not be ideal, as it may blow up binary size. You shoudld try out other optimization levels as well. – paleonix Apr 20 '21 at 19:54
  • 1
    "Faster" might mean "more code" might mean "your code no longer fits in cache" which could end up being a step backwards. – tadman Apr 20 '21 at 20:12
  • If it is the cache, then you could try to build for size. But is your expectation that the code should be faster? Maybe you already had the best option and can only make it worse? – Devolus Apr 20 '21 at 20:56
  • Do you use external library in this specific part of the code (eg. like BLAS)? Beside this, is the use of `-msse2 -msse3 -mssse3 -msse4 -msse4.1 -msse4.2` "fast"? What about `-mavx`? – Jérôme Richard Apr 20 '21 at 20:59
  • @MartinYork The Intel documentation says "Tells the compiler to generate code for processors that support certain features.", which is pretty vague. I was hoping to gain speedup, such as the large instruction set support by this family of processors. My code consists of many functions, which makes it hard to investigate them. Also, I don't have experience with assembly. – Abaris Apr 20 '21 at 21:04

0 Answers0