How does crossing a cacheline affect how loops are decoded?

Question

Summary

Basically I have noticed that many small loops suffer serious performance degregations when they cross a cache-line. This appears to be exclusive to cache line crosses, and unrelated to whether the loop crosses any other fetch block boundary.

The issue appears to be related to decode in some way.

Questions:

So my questions are:

How does crossing a cacheline affect how loops are decoded?
How can this affect loop performance?

Investigation so Far

I've been using the following test program, Makefile, and run scripts are available on github

#define LOOP_CNT (1000 * 1000 * 1000)

    /* Just #define for NOP{N}. */
#include "nops.h"
#define XOR 0
#define MOV 1
#define DSB0    2
#define DSB1    3
#define DSB2    4
#define DSB3    5

#ifdef RUNALL
# include "padding.h"
#else
# define PAYLOAD    DSB0

    /* Just to see if certain cacheline offsets within a page may
       change things. So far nothing.  */
# define CACHE_PADDING  NOP0

    /* Offset within cache line. Using NOP12, NOP28, NOP44, and
       NOP60.  */
# define LOOP_PADDING   NOP60
#endif
    .global _start
    .text
_start:
    movl    $LOOP_CNT, %ecx
    leaq    (2048 + buf_start)(%rip), %rsp
    movq    %rsp, %rsi
    leaq    64(%rsi), %rdi

    /* Page align.  */
    .p2align 12

    /* Padding config.  */
    LOOP_PADDING

    /* Loop is 8 bytes.  */
loop:
#if PAYLOAD == XOR
    /* Payload where loop is really just the decl/jnz.  */
    xorl    %eax, %eax
    xorl    %edx, %edx
#elif PAYLOAD == MOV
    /* Payload testing if it has to do with memory bandwidth.  */
    movl    (%rsi), %eax
    movl    (%rdi), %edx
#elif PAYLOAD == DSB0
    /* Payload testing how being able to run out of the LSD vs DSB
       changes things. LSD disabling behavior in first fetch block.  */
    popq    %rax
    movq    %rsi, %rsp
#elif PAYLOAD == DSB1
    /* Payload testing how being able to run out of the LSD vs DSB
       changes things. LSD disabling behavior in second fetch block.
     */
    movq    %rsi, %rsp
    popq    %rax
#elif PAYLOAD == DSB2
    /* Payload testing how being able to run out of the LSD vs DSB
       changes things. LSD disabling behavior in first fetch block and
       non-eliminatable mov cache line.  */
    popq    %rax
    leaq    (%rsi), %rsp
#elif PAYLOAD == DSB3
    /* Payload testing how being able to run out of the LSD vs DSB
       changes things. LSD disabling behavior in second fetch block and
       non-eliminatable first cache line.  */
    leaq    (%rsi), %rsp
    popq    %rax
#else
# error NO PAYLOAD
#endif
    decl    %ecx
    jnz loop

    movl    $60, %eax
    xorl    %edi, %edi
    syscall

    .section .data
    .balign 4096
buf_start:  .space 4096
buf_end:

I am interested in testing NOP12, NOP28, NOP44, and NOP60 as all of them have the loop cross fetch block but only NOP60 has the loop cross a cache line.

I am noticing that performance gets worse when the loop crosses a cache line (NOP60) and is consistently "good" (1 cycle / iteration) otherwise.

So with PAYLOAD set as XOR I get the following results for cycles,lsd_uops,dsb_uops,and mite_uops.

Note: Times are median of N=5 runs and run on Tigerlake:

PAYLOAD	LOOP_PADDING	CYCLES	LSD_UOPS	DSB_UOPS	MITE_UOPS
XOR	12	1.000e+09	3.000e+09	4.629e+03	4.468e+03
XOR	28	1.001e+09	3.000e+09	1.651e+04	8.511e+03
XOR	44	1.000e+09	3.000e+09	4.869e+03	4.697e+03
XOR	60	1.102e+09	2.693e+09	4.328e+03	3.069e+08

So on cache cross performance seems to suffer roughly 10% and the cause seems to that the loop begins being partially decoded by the MITE as opposed to fully decoded by the LSD.

If we force the loop to note use the LSD by mismatching push/pop operations, the results get much worse for the cache cross.

PAYLOAD	LOOP_PADDING	CYCLES	LSD_UOPS	DSB_UOPS	MITE_UOPS
DSB0	12	1.000e+09	0.000e+00	3.000e+09	1.050e+04
DSB0	28	1.001e+09	0.000e+00	3.000e+09	2.267e+04
DSB0	44	1.003e+09	0.000e+00	3.000e+09	8.710e+04
DSB0	60	2.001e+09	1.822e+05	2.522e+09	4.773e+08

Here it takes 2x as long as the other fetch block crosses.

As well, we can see that the cache line cross seems to continue to get some uops from the LSD. To me this hints that may be loop is being "split" along the cache line in some way.

With the current DSB0 version the loop is split as follows:

000000000000003c <loop>:
    003c:   58                      pop    %rax
    003d:   48 89 f4                mov    %rsi,%rsp
    0040:   ff c9                   dec    %ecx
    0042:   75 f8                   jne    103c <loop>

The second cache line doesn't have LSD disabling behavior and it appears that at least sometimes the dec; jne is being decoded by the LSD.

If we reorder the pop and mov (version DSB1) and +1 to LOOP_PADDING so with NOP61 the loop would be:

000000000000003d <loop>:
    003d:   48 89 f4                mov    %rsi,%rsp
    0040:   58                      pop    %rax
    0041:   ff c9                   dec    %ecx
    0043:   75 f8                   jne    103c <loop>

We get the follow results:

PAYLOAD	LOOP_PADDING	CYCLES	LSD_UOPS	DSB_UOPS	MITE_UOPS
DSB1	13	1.008e+09	0.000e+00	3.000e+09	1.837e+05
DSB1	29	1.008e+09	0.000e+00	3.000e+09	1.269e+05
DSB1	45	1.005e+09	0.000e+00	3.000e+09	1.904e+05
DSB1	61	2.007e+09	0.000e+00	2.699e+09	3.013e+08

Where the loop no longer ever runs out of the LSD though still 2x performance degregation that appears to be due to the loop running out of the MITE as opposed to a faster decoder (DSB in this case). This also suggests that the 2x performance degregation is a function of swapping between the DSB and MITE as opposed to only ~10% degregation from swapping between the LSD and MITE.

If we check the PAYLOAD == XOR on Skylake where there is no LSD we see:

PAYLOAD	LOOP_PADDING	CYCLES	LSD_UOPS	DSB_UOPS	MITE_UOPS
XOR	12	1.021e+09	0.000e+00	3.002e+09	1.190e+06
XOR	28	1.019e+09	0.000e+00	3.003e+09	1.469e+06
XOR	44	1.020e+09	0.000e+00	3.003e+09	1.526e+06
XOR	60	2.044e+09	0.000e+00	2.755e+09	2.490e+08

Which supports the theory that the 2x degregation in performance is a function of DSB/MITE swapping.

From the tests I have the following observations:

Crossing a cache line can disrupt both loop optimized decoders. This is unique to crossing a cacheline and not an arbitrary fetch block.
Loops that cross a cache line can be treated as multiple entities by the decoder. - This one feels especially weird!
Crossing a cache line is worse when the loop is running out of the DSB than when its running out of the LSD.

But I don't really understand why any of these observations are the case.

Can anyone help explain what is going on?

So you're putting the top of the loop near the end of a 32-byte block that was full of single-byte NOPs? You'd hope that on looping back to the start of the loop, the CPU would try again to cache the loop part of that block in the DSB, but if it didn't then yeah you'd have a DSB to MITE penalty every iteration. Take a look at `dsb2mite_switches.count` and `dsb2mite_switches.penalty_cycles`. — Peter Cordes, Sep 23 '21 at 22:35
@PeterCordes I don't think its 32 byte block related. For example with `NOP28` we don't see any corresponding behavior. Yeah the `dsb2mite_switches.penalty_cycles` is high. I don't think Tigerlake supports `dsb2mite_switches.count` or at least I haven't been able to find the encoding for it in libpfm4. — Noah, Sep 23 '21 at 22:38
Or you're just breaking the LSD, and the loop contains instructions that start in two different 32-byte blocks, so they can't be part of the same uop cache line. Since only 1 uop cache line can be read per clock cycle, this reduces DSB throughput. — Peter Cordes, Sep 23 '21 at 22:38
@PeterCordes it depends on `NOP{N}`. But we break the DSB at `NOP28` and `NOP60` but the two do NOT have corresponding behavior so I think its fundamentally tied to cache line crosses. Especially since we also see some degregation when the loop may run out of the LSD, but only on cache line cross. — Noah, Sep 23 '21 at 22:39
@PeterCordes but IMO the strangest behavior is the first cache where I disabled the LSD but exclusively on the cache line cross we occasionally see part of the loop decoded by the LSD, part decoded by the DSB and part decoded by the MITE. — Noah, Sep 23 '21 at 22:41
Macro-fusion of `dec / jnz` that split perfectly across a 32-byte boundary is possible IIRC, but not across a 64-byte boundary. Could that be part of it? That wouldn't explain the high MITE counts, though. Does this effect disappear if you use long NOPs for your `NOP{N}` definitions? (I haven't fully read the question in detail yet, mostly just looked at the disassembly and tables and skimmed the text). — Peter Cordes, Sep 23 '21 at 22:42
@PeterCordes the `dec; jnz` is intentionally not split across a fetch boundary. The split is between the 2 payload operations and the `dec; jnz` (or in some later testing the payload is also split across fetch blocks. But the `dec; jnz` is always in the same fetch block.) — Noah, Sep 23 '21 at 22:43
@PeterCordes it may be that the info "you can only read from one DSB line per cycle" is slightly incorrect and its "you can only read from one cache line of dsb lines per cycle"? The `NOP28` case is essentially a contradiction to the first statement. — Noah, Sep 23 '21 at 22:45
@PeterCordes I was able to reproduce the results on Skl so I imagine you should as well if you checkout the github. But even if the rule for DSB is really cache line related it doesn't really explain the weird behavior with the LSD (That cache line splits cant run as effectively out of the LSD and the case of part of the loop running out of the LSD). — Noah, Sep 23 '21 at 22:46
Yeah, agreed, I just tried myself with a NASM test loop (without the complications of single-byte nops maybe busting the DSB), and Skylake was able to run a loop spanning a 32-byte boundary within a cache line at 1/clock, or 1.25/clock with 5 uops in it. IIRC, BeeOnRope thought at one point that Skylake might have widened the x86 code block size to 64-byte for what one way of the uop cache could cache. But maybe he was just seeing this effect. (Other evidence supports it still being 32-byte, including the JCC erratum triggering at that boundary.) — Peter Cordes, Sep 23 '21 at 23:00
Your Tiger Lake results are interesting with the LSD counts, like we got *some* LSD counts but not all, so it was unstable in being able to stay "locked on"? Maybe an interrupt handler evicted the uop cache lines that were preventing the top of the loop from going into the DSB at some point, and once the whole loop was in the DSB, the LSD was able to lock down the uops? The 1.1e9 cycles when 2.6e9 uops came from the LSD might explain the timing, running slow for some portion. — Peter Cordes, Sep 23 '21 at 23:02
@PeterCordes what interrupts? I don't see any context switches. But you may be onto something. I reran it a few times and while there is always a LSD/DSB/MITE split, I see the number from the LSD changing by up to 2 order of magnitude. It's still slow (from .01% of the uops to 1% but still). Also its not single byte nops. They are defined in [nops.h](https://github.com/goldsteinn/cache_cross/blob/main/nops.h) — Noah, Sep 23 '21 at 23:07
Timer / hardware interrupts don't cause "context switches", just round trips to the kernel. The `context-switches` kernel perf event counts when a different user-space task is scheduled onto the CPU this thread was running on, I assume. i.e. when an FPU save happens. Not for every timer / mouse / network / etc. interrupt handled on this core. If you look at cycles vs. time for a long-running microbench loop counting only user cycles, there are some missing. (And probably not just from clock-speed transitions.) — Peter Cordes, Sep 23 '21 at 23:11
@PeterCordes ah I see. Any perf events for these you know of? But yes I think that `1.1e9` cycles was caused by less uops running from the LSD and more from the MITE decoder. But this is especially suprising but I was under the impression that in the LSD loop alignment didn't matter as it just replayed decoded uops in the RAT. — Noah, Sep 23 '21 at 23:16
@PeterCordes but maybe its interrupts again and loops that cross a cache line just take longer the enter the lsd? — Noah, Sep 23 '21 at 23:16
@PeterCordes so ran a quick test and it does seem that it takes longer for loops that cross a cache line to enter the LSD. So interrupts might explain why it ends up with worse behavior. It just spends more time in slow decode states due to interrupts. — Noah, Sep 23 '21 at 23:24
@PeterCordes the variation in the LSD case is also quite high so if it where interrupt related that would help explain that. — Noah, Sep 23 '21 at 23:28
Right, once locked down into the LSD, alignment doesn't matter. But that can't happen if you get into a condition where that doesn't happen. e.g. if getting there via decode of a bunch of 1-byte NOPs leaves you in a bad state, you're stuck in the slow lane until an interrupt-return jumps to somewhere inside the loop without going through the NOPs. Does this slow-start (or slow end?) effect go away if you use 2 or 3-byte instructions as your filler before the loop, so there's no obstacle to caching the loop's uops in the DSB to start with? (The LSD only works on uops from the DSB) — Peter Cordes, Sep 24 '21 at 00:20
@PeterCordes err, I think there is a misunderstanding about the benchmark. 1) the nops are not part of the loop. They are just for padding pre-loop. 2) The nops are defined [here](https://github.com/goldsteinn/cache_cross/blob/main/nops.h) and use largest nop size available. The benchmark is essentially: `padding; ; ; dec; jnz `. So in the loop there are no nops. The loop is just 4 instructions. The `LOOP_PADDING` just determined where `` is relative to a cache line but is not continuously run inside the loop. — Noah, Sep 24 '21 at 01:31
@PeterCordes so I think the LSD stuff is figured out with the interrupts. I think the DSB stuff is that you cannot read from two uop cache lines at the same time IFF they also cross a 64 byte cache line. The loop bodies all have 1c/iteration tput, so when the loop crosses a cache line the bottleneck of only being able to decode half of the loop at a time becomes the bottleneck. If I expand the loops so they are 2 or more cycles / iteration it no longer seems to matter. — Noah, Sep 24 '21 at 02:53
Agreed it seems that DSB can read multiple ways in parallel within a 64B line, but that doesn't explain any MITE. *That's* what I was trying to explain. If everything just nicely goes into the DSB, it should get locked down in the LSD very soon. Unless the slow fetch from the DSB interferes with that?? But triggering fallback to MITE? — Peter Cordes, Sep 24 '21 at 03:41
I know the NOPs aren't part of the loop, but my hypothesis was that once the uop cache overflows (more than 3 ways for a 32-byte block), it would give up on trying to put further uops in that block into the uop cache. (There is some heuristic for when to start trying to fill the DSB again, but even a loop branch might do it, in which case nasty code before a loop couldn't mess up the loop for more than a couple iterations. But if not...). I also assumed you had 1-byte NOPs based on a name like NOP61, but I guess that was just a size not also a count. Long NOPs rules out busting the DSB — Peter Cordes, Sep 24 '21 at 03:43
@PeterCordes ah I see, my bad for misunderstanding. So looking at the LSD vs DSB numbers; the number of MITE-uops / total-cycles seems to be about the same for the two. Maybe still interrupt related. I tried to test if it took longer for the cache line cross to enter DSB but didn't see any evidence of that. And it can't be related to having to fill two lines as 32 byte cross doesn't show the same behavior. — Noah, Sep 24 '21 at 04:43
@PeterCordes Maybe there is some mechanism where the current cache line of IP is kept "hot" in some way on interrupts but the other one needs to be refetched/decoded through MITE/brought into DSB? — Noah, Sep 24 '21 at 04:45
@BeeOnRope Peter mentioned that you had run some benchmarks which pointed to DSB line size == cache line size. Any chance you can point me in that direction? — Noah, Sep 26 '21 at 16:32

How does crossing a cacheline affect how loops are decoded?

Summary

Questions:

Investigation so Far

0 Answers0