Theoretical Scalar Integer Performance KabyLake

Question

I was doing some experiments with Intel Advisor 2020 and in particular with the roofline model. Something I can't quite understand is why the peak scalar integer performance (intop/cycle) is different than the theoretical one that I would expect especially since all other metrics match more or less (vector integer performance, floating point..)

In particular according to Intel Advisor the max peak performance (for add) is around 2.3 integer operations per cycle while the theoretical value I would expect to find is 4 intop/cycle since we have 4 INT ALU in 4 different ports.

Am I missing something?

I haven't used Intel Advisor, but if you something in a loop there's loop overhead. Unless one of the adds is a macro-fused `add ecx, -1` / `jnz` loop branch, then yes you can achieve exactly 4 `add` uops per clock, and it's a single-uop instruction. Note that integer multiply has only 1/clock throughput; like other scalar-integer uops with 3 cycle latency instead of 1 it can only run on port 1. https://uops.info/ / https://agner.org/optimize/ — Peter Cordes, May 28 '20 at 21:37
@PeterCordes Thanks for the answer, I will check again when I am home but I thought the add scalar peak that Intel Advisor was giving me would be a best case scenario so that's why I was expecting it to be 4 (which would make more sense for a roofline plot imo). But I guess it is giving me a more "realistic" scenario then. I will search more because I couldn't find documentation telling me exactly how they compute the value. Thanks — Tommy95, May 28 '20 at 21:41
Oh, also, `lea eax, [rdi + rsi + 123]` does 2 additions per 1 uop, so with that in the mix you can get 5 additions per clock cycle (3 add / 1 slow-LEA). The LEA is "slow" (3 cycle latency, port 1 only) because it has 3 components (2 `+` in the addressing mode). If you count `lea eax, [rdi + rsi*2 + 123]` as doing `rdi + rsi + rsi + 123`, then that's yet another addition, but it's really shifting. See also https://stackoverflow.com/tags/x86/info for more x86 performance links. — Peter Cordes, May 28 '20 at 21:57
So just for the record, Intel Advisor 2020 just has some numbers that are supposed to be generic limits for KBL, unrelated to any specific code you might be analyzing? And it gives 3/cycle throughput for vector integer operations like `vpsubd` that can run on any vector ALU port (p015), but 2.3/cycle for scalar integer stuff like `add` that can run on any ALU port (p0156)? That's insane. Or is it not by asm instruction? — Peter Cordes, May 30 '20 at 03:50

Theoretical Scalar Integer Performance KabyLake

0 Answers0