4

Assuming AVX2-targeted compilation and with C++ intrinsics, if I write an nbody algorithm using 17 registers per body-body computation, can 17th register be indirectly(register rename hardware) or directly(visual studio compiler, gcc compiler) be mapped on an AVX-512 register to cut memory dependency off? For example, skylake architecture has 1 or 2 AVX-512 fma units. Does this number change total registers available too? (specifically, a xeon silver 4114 cpu)

If this works, how does it work? 1st hardware thread using first half of each ZMM vector and 2nd hardware thread using second half of each ZMM vector when all instructions are AVX2 or less?


Edit: What if there will be online-compilation on target machine(with OpenCL, for example)? Can drivers do above register usage for me?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97

2 Answers2

11

TL:DR: compile with -march=skylake-avx512 to let the compiler use EVEX prefixes to access ymm16-31 so it can (hopefully) make better asm for code that has 17 __m256 values "live" at once.

-march=skylake-avx512 includes -mavx512vl


For example, skylake architecture has 1 or 2 AVX-512 fma units. Does this number change total registers available too?

No, the physical register file is the same size in all Skylake CPUs, regardless of how many FMA execution units are present. These things are totally orthogonal.

The number of architectural YMM registers is 16 for 64-bit AVX2, and 32 for 64-bit AVX512VL. In 32-bit code, there are always only 8 vector registers available, even with AVX512. (So 32-bit is very obsolete for most high-performance computing.)

The longer EVEX encoding required for YMM16-31 with AVX512VL1 + AVX2, but instructions with all operands in the low 16 can use the shorter VEX prefix AVX/AVX2 form of the instruction. (There's no penalty for mixing VEX and EVEX encodings, so VEX is preferable for code-size. But if you avoid y/zmm0-y/zmm15, you don't need VZEROUPPER; legacy-SSE instructions can't touch xmm16-31 so there's no possible problem.)

Again, none of this has anything to do with the amount of FMA execution units present.

Footnote 1: AVX512F only includes the ZMM versions of most instructions; you need AVX512VL for the EVEX encoding of most YMM instructions. The only CPUs with AVX512F but not AVX512VL are Xeon Phi, KNL / KNM, now discontinued; all mainstream CPUs support xmm/ymm versions of all the AVX512 instructions they support.

if I write an nbody algorithm using 17 registers per body-body computation, can 17th register be indirectly(register rename hardware) mapped

No, this not how CPUs and machine code work. In machine code, there's only a 4-bit (without using AVX512-only encodings) or 5-bit (with AVX512 encodings) field to specify a register operand for an instruction.

If your code needs 17 vector values to be "live" at once, the compiler will have to emit instructions to spill/reload one of them when targeting x86-64 AVX2, which architecturally only has 16 YMM registers. i.e. it has 16 different names which the CPU can rename onto its larger internal register file.

If register renaming solved the whole problem, x86-64 wouldn't have bothered increasing the number of architectural registers from 8 integer / 8 xmm to 16 integer / 16 xmm.

This is why AVX512 spent 3 extra bits (1 each for dst, src1, and src2) to allow access to 32 architectural vector registers beyond what VEX prefixes can encode. (Only in 64-bit mode; 32-bit mode still only has 8. In 32-bit mode, VEX and EVEX prefixes are invalid encodings of existing instructions, and flipping those extra register-number bits would make them decode as valid encodings of those old instructions instead of as prefixes.)


Register renaming allows reuse of the same architectural register for a different value without any false dependency. i.e. it avoids WAR and WAW hazards; it's part of the "magic" that makes out-of-order execution work. It helps keep more value in flight when considering ILP and out-of-order execution, but it doesn't help you have more values in architectural registers at any point in simple program order of execution.

For example, the following loop only needs 3 architectural registers, and each iteration is independent (no loop-carried dependency, other than the pointer-increment).

.loop:
    vaddps   ymm0, ymm1, [rsi]  ; ymm0 = ymm1, [src]
    vmulps   ymm0, ymm0, ymm2   ; ymm0 *= ymm2
    vmovaps  [rsi+rdx], ymm0    ; dst = src + (dst_start - src_start).  Stays micro-fused on Haswell+

    add      rsi, 32
    cmp      rsi, rcx   ; }while(rsi < end_src)
    jb   .loop

But with an 8-cycle latency chain from the first write of ymm0 to the last read within an iteration (Skylake addps / mulps are 4 cycles each), it would bottleneck on that, on a CPU without register renaming. The next iteration couldn't write to ymm0 until the vmovaps in this iteration had read the value.

But on an out-of-order CPU, multiple iterations are in-flight at once, with each write to ymm0 renamed to write a different physical register. Ignoring the front-end bottleneck (pretend we unrolled), the CPU can keep enough iterations in flight to saturate the FMA unit with 2 addps/mulps uops per clock, using about 8 physical registers. (Or more, because they can't actually be freed until retirement, not just as soon as the last uop has read that value).

The limited physical register file size can be the limit on the out-of-order windows size, instead of the ROB or scheduler size.

(We thought for a while that Skylake-AVX512 uses 2 PRF entries for a ZMM register, based on this result, but later more detailed experiments revealed that AVX512 mode powers up a wider PRF, or upper lanes to complement the existing PRF, so SKX in AVX512 mode still has the same number of 512-bit physical registers as 256-bit physical registers. See discussion between @BeeOnRope and @Mysticial. I think there was a better write-up of an experiment + results somewhere but I can't find it ATM.)


Related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (answer: it doesn't; the OP was confused about register-reuse. My answer explains in lots of detail, with some interesting performance experiments with multiple vector accumulators.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • A single instruction stucked/frozen doesn't stop whole window does it? Are there any conditions that make an instruction not able to retire for a long time? – huseyin tugrul buyukisik Feb 20 '18 at 20:37
  • @huseyintugrulbuyukisik: One "stuck" instruction like a cache-miss load does require a large out the out-of-order window to hide that latency. If the ROB fills with executed but not retired uops, it stalls. If the RS fills with not-executed uops (all dependent on the cache-miss load), it stalls. This is a major problem in CPU design as CPU frequencies get higher relative to memory-access times. Major new ideas like the kilo-instruction processor which checkpoints and allows out-of-order retirement may be the way forward in the long term. https://www.csl.cornell.edu/~martinez/doc/taco04.pdf – Peter Cordes Feb 20 '18 at 20:45
  • This is first time I see a "out-of-order retirement". I thought they were all retiring in the order they were issued(but executed out of order). Or thats my ignorance. Thank you. Skylake is kilo-instruction-ish I guess or do you mean per thread or is it issue width(where skylake is 4-6-8 wide)? – huseyin tugrul buyukisik Feb 20 '18 at 20:50
  • @huseyintugrulbuyukisik: No, read the paper I linked. Out-of-order retirement / KIP is a totally new idea; Skylake doe *not* work that way; SKL retires in-order (like everything else) and [the ROB size is (only) 224 uops](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Execution_engine), nowhere near 1k instructions. Skylake is 4-wide. I only mentioned KIP because it's a theoretical CPU-architecture idea for letting a CPU not stall when one instruction gets stuck. – Peter Cordes Feb 20 '18 at 22:47
5

No. If you target AVX2 architectures, then the generated code has to be able to run on any AVX2-capable CPU. Many of those do not support AVX-512, so they do not have the extra registers that you'd like to use.

With that said, there's no reason why you can't compile with AVX512VL support (i.e. -mavx512vl in gcc) and write your code using AVX2 intrinsics. In this case, the compiler would be able to use the additional registers, because it is targeting AVX-512 architectures, all of which contain 32 [xyz]mm registers.

Jason R
  • 11,159
  • 6
  • 50
  • 81
  • The "extra" registers have been there for quite a while now in the form of the renamed registers. You just can't access them directly. – Mysticial Feb 20 '18 at 19:39
  • 3
    AVX512F is not sufficient: you need AVX512VL to use YMM16-31 instead of the full ZMM16-31 for the EVEX encodings of most instructions. Use `-march=skylake-avx512`. – Peter Cordes Feb 20 '18 at 19:44
  • @PeterCordes This question actually brings up another question. Physically, how many registers are there? The slides for Skylake client show 168 "FP" registers which usually implies vector registers. But it doesn't say how large they are. Skylake server with AVX512 shares the same core as Skylake client, but with the external L2 and FMA. – Mysticial Feb 20 '18 at 19:50
  • @PeterCordes If the 168 registers are 512-bit wide, that would imply a lot of dead silicon on all the Skylake client chips. Or perhaps they are only 256-bit wide, and in 512-bit mode, they combine in pairs. Interestingly I have seen things that seem to support this. I have some (FP-only) code with long dependency chains that when comparing 256-bit vs. 512-bit in otherwise identical sequences (and identical clock frequency), the 512-bit one is significantly slower. And I don't think the 6-cycle port5 latency is enough to explain it. – Mysticial Feb 20 '18 at 19:52
  • @Mysticial: yeah I wondered about that. If each PRF entry is big enough to hold a ZMM register, that's a lot of wasted transistors in Skylake-client where only the low 256 bits are usable. Using up a pair of PRF entries makes a lot of sense with AVX512 being new and rarely used, and would go some way toward explaining why SKX has to shut down a vector ALU port when 512b ops are in flight. (Register-read port limits if reading a ZMM register takes two register-read ports). So you think the out-of-order window size is measurably smaller with ZMM registers? – Peter Cordes Feb 20 '18 at 19:55
  • @PeterCordes I've never been at the level to measure the OOO window. But I do see (through VTune) that there are large (30%) drops in IPC when comparing certain sequences of identical code that differ only in the 256-bit vs. 512-bit (again all FP so the usable port count is still 2 in both cases). Those sequences do have dependency chains long enough to require an OOO window of over 100 to fully saturate both the 512-bit FMA units. – Mysticial Feb 20 '18 at 19:59
  • @Mysticial That seems like a likely explanation, then. PRF size can be the bottleneck for the out-of-order window: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/. – Peter Cordes Feb 20 '18 at 20:01
  • @PeterCordes Interesting link. Thanks! – Mysticial Feb 20 '18 at 20:07
  • I edited question, it will be online compilation on target machine through OpenCL and CPU's drivers. Does this make a difference versus offline compilation? – huseyin tugrul buyukisik Feb 20 '18 at 20:24