I'm trying to understand the behavior of the uop-cache (DSB in intel docs) on my Haswell chip. I'm basing myself on the Intel optimization manual and the Agner pdfs.
I've found a set of cases where the frontend reliably falls back to the MITE decoder depending on slight changes in the code that leaves me confused.
An example of that looks like this (in gnu as -msyntax=intel
) :
mov rax, 100000000
.align 64
1:
// DSB-cacheable nops to
// overflow LSD
.fill 12, 1, 0x90
.align 32
.fill 12, 1, 0x90
.align 32
.fill 12, 1, 0x90
.align 32
.fill 12, 1, 0x90
.align 32
// this first block should fill up way 1 of our uop-cache set
add ecx, ecx
nop
nop
nop
add ecx, ecx
add ecx, ecx
// way 2
add ecx, ecx
add ecx, ecx
add ecx, ecx
nop
add ecx, ecx
add ecx, ecx
// way 3
or ecx, ecx
or ecx, ecx // <---- an example of offending instruction
or ecx, ecx
or ecx, ecx
or ecx, ecx
or ecx, ecx // <---- this one as well
// next uop set
dec rax
jnz 1b
Obviously it is a nonsensical example. I generally include it as part of a loop containing enough other 32-bytes blocks to overflow the LSD to make things simpler but it's not required AFAICT. For similar reasons I made this block exactly 32 bytes to rule out anything related to dangling instructions.
I measure the usage of DSB vs MITE using the two corresponding perf counters (IDQ.MITE_UOPS and IDQ.DSB_UOPS).
As it is written this 32-byte block will be cached in the DSB, however changing one of the marked or ecx, ecx
into an add ecx, ecx
is enough to trigger the legacy decoder.
This is surprising to me because both instructions have the same size and both generate 1 uop.
In fact, playing around with similar examples, the only common point I found between instructions that do or do not trigger different caching behavior is whether or not they would macro-fuse with branches if there was any.
I can't find a description of this (or any related) behavior anywhere, is there something I'm missing?