Summary
Basically I have noticed that many small loops suffer serious performance degregations when they cross a cache-line. This appears to be exclusive to cache line crosses, and unrelated to whether the loop crosses any other fetch block boundary.
The issue appears to be related to decode in some way.
Questions:
So my questions are:
- How does crossing a cacheline affect how loops are decoded?
- How can this affect loop performance?
Investigation so Far
I've been using the following test program, Makefile
, and run scripts are available on github
#define LOOP_CNT (1000 * 1000 * 1000)
/* Just #define for NOP{N}. */
#include "nops.h"
#define XOR 0
#define MOV 1
#define DSB0 2
#define DSB1 3
#define DSB2 4
#define DSB3 5
#ifdef RUNALL
# include "padding.h"
#else
# define PAYLOAD DSB0
/* Just to see if certain cacheline offsets within a page may
change things. So far nothing. */
# define CACHE_PADDING NOP0
/* Offset within cache line. Using NOP12, NOP28, NOP44, and
NOP60. */
# define LOOP_PADDING NOP60
#endif
.global _start
.text
_start:
movl $LOOP_CNT, %ecx
leaq (2048 + buf_start)(%rip), %rsp
movq %rsp, %rsi
leaq 64(%rsi), %rdi
/* Page align. */
.p2align 12
/* Padding config. */
LOOP_PADDING
/* Loop is 8 bytes. */
loop:
#if PAYLOAD == XOR
/* Payload where loop is really just the decl/jnz. */
xorl %eax, %eax
xorl %edx, %edx
#elif PAYLOAD == MOV
/* Payload testing if it has to do with memory bandwidth. */
movl (%rsi), %eax
movl (%rdi), %edx
#elif PAYLOAD == DSB0
/* Payload testing how being able to run out of the LSD vs DSB
changes things. LSD disabling behavior in first fetch block. */
popq %rax
movq %rsi, %rsp
#elif PAYLOAD == DSB1
/* Payload testing how being able to run out of the LSD vs DSB
changes things. LSD disabling behavior in second fetch block.
*/
movq %rsi, %rsp
popq %rax
#elif PAYLOAD == DSB2
/* Payload testing how being able to run out of the LSD vs DSB
changes things. LSD disabling behavior in first fetch block and
non-eliminatable mov cache line. */
popq %rax
leaq (%rsi), %rsp
#elif PAYLOAD == DSB3
/* Payload testing how being able to run out of the LSD vs DSB
changes things. LSD disabling behavior in second fetch block and
non-eliminatable first cache line. */
leaq (%rsi), %rsp
popq %rax
#else
# error NO PAYLOAD
#endif
decl %ecx
jnz loop
movl $60, %eax
xorl %edi, %edi
syscall
.section .data
.balign 4096
buf_start: .space 4096
buf_end:
I am interested in testing NOP12
, NOP28
, NOP44
, and NOP60
as
all of them have the loop cross fetch block but only NOP60
has the
loop cross a cache line.
I am noticing that performance gets worse when the loop crosses a
cache line (NOP60
) and is consistently "good" (1 cycle / iteration)
otherwise.
So with PAYLOAD
set as XOR
I get the following results for cycles
,lsd_uops
,dsb_uops
,and mite_uops
.
Note: Times are median of N=5 runs and run on Tigerlake:
PAYLOAD | LOOP_PADDING | CYCLES | LSD_UOPS | DSB_UOPS | MITE_UOPS |
---|---|---|---|---|---|
XOR | 12 | 1.000e+09 | 3.000e+09 | 4.629e+03 | 4.468e+03 |
XOR | 28 | 1.001e+09 | 3.000e+09 | 1.651e+04 | 8.511e+03 |
XOR | 44 | 1.000e+09 | 3.000e+09 | 4.869e+03 | 4.697e+03 |
XOR | 60 | 1.102e+09 | 2.693e+09 | 4.328e+03 | 3.069e+08 |
So on cache cross performance seems to suffer roughly 10% and the
cause seems to that the loop begins being partially decoded by the
MITE
as opposed to fully decoded by the LSD
.
If we force the loop to note use the LSD
by mismatching push
/pop
operations, the results get much worse for the cache cross.
PAYLOAD | LOOP_PADDING | CYCLES | LSD_UOPS | DSB_UOPS | MITE_UOPS |
---|---|---|---|---|---|
DSB0 | 12 | 1.000e+09 | 0.000e+00 | 3.000e+09 | 1.050e+04 |
DSB0 | 28 | 1.001e+09 | 0.000e+00 | 3.000e+09 | 2.267e+04 |
DSB0 | 44 | 1.003e+09 | 0.000e+00 | 3.000e+09 | 8.710e+04 |
DSB0 | 60 | 2.001e+09 | 1.822e+05 | 2.522e+09 | 4.773e+08 |
Here it takes 2x as long as the other fetch block crosses.
As well, we can see that the cache line cross seems to continue to get
some uops from the LSD
. To me this hints that may be loop is being
"split" along the cache line in some way.
With the current DSB0
version the loop is split as follows:
000000000000003c <loop>:
003c: 58 pop %rax
003d: 48 89 f4 mov %rsi,%rsp
0040: ff c9 dec %ecx
0042: 75 f8 jne 103c <loop>
The second cache line doesn't have LSD
disabling behavior and
it appears that at least sometimes the dec; jne
is being decoded by
the LSD
.
If we reorder the pop
and mov
(version DSB1
) and +1 to LOOP_PADDING
so with NOP61
the loop would be:
000000000000003d <loop>:
003d: 48 89 f4 mov %rsi,%rsp
0040: 58 pop %rax
0041: ff c9 dec %ecx
0043: 75 f8 jne 103c <loop>
We get the follow results:
PAYLOAD | LOOP_PADDING | CYCLES | LSD_UOPS | DSB_UOPS | MITE_UOPS |
---|---|---|---|---|---|
DSB1 | 13 | 1.008e+09 | 0.000e+00 | 3.000e+09 | 1.837e+05 |
DSB1 | 29 | 1.008e+09 | 0.000e+00 | 3.000e+09 | 1.269e+05 |
DSB1 | 45 | 1.005e+09 | 0.000e+00 | 3.000e+09 | 1.904e+05 |
DSB1 | 61 | 2.007e+09 | 0.000e+00 | 2.699e+09 | 3.013e+08 |
Where the loop no longer ever runs out of the LSD
though still 2x
performance degregation that appears to be due to the loop running out
of the MITE
as opposed to a faster decoder (DSB
in this case). This also suggests that the 2x performance degregation is a function of swapping between the DSB
and MITE
as opposed to only ~10% degregation from swapping between the LSD
and MITE
.
If we check the PAYLOAD == XOR
on Skylake where there is no LSD
we see:
PAYLOAD | LOOP_PADDING | CYCLES | LSD_UOPS | DSB_UOPS | MITE_UOPS |
---|---|---|---|---|---|
XOR | 12 | 1.021e+09 | 0.000e+00 | 3.002e+09 | 1.190e+06 |
XOR | 28 | 1.019e+09 | 0.000e+00 | 3.003e+09 | 1.469e+06 |
XOR | 44 | 1.020e+09 | 0.000e+00 | 3.003e+09 | 1.526e+06 |
XOR | 60 | 2.044e+09 | 0.000e+00 | 2.755e+09 | 2.490e+08 |
Which supports the theory that the 2x degregation in performance is a
function of DSB
/MITE
swapping.
From the tests I have the following observations:
- Crossing a cache line can disrupt both loop optimized decoders. This is unique to crossing a cacheline and not an arbitrary fetch block.
- Loops that cross a cache line can be treated as multiple entities by the decoder. - This one feels especially weird!
- Crossing a cache line is worse when the loop is running out of the
DSB
than when its running out of theLSD
.
But I don't really understand why any of these observations are the case.
Can anyone help explain what is going on?