Why is this code not hitting the micro-op cache on Haswell when changing a single instruction?

Question

I'm trying to understand the behavior of the uop-cache (DSB in intel docs) on my Haswell chip. I'm basing myself on the Intel optimization manual and the Agner pdfs.

I've found a set of cases where the frontend reliably falls back to the MITE decoder depending on slight changes in the code that leaves me confused.

An example of that looks like this (in gnu as -msyntax=intel) :

    mov rax, 100000000
    .align 64

1:
// DSB-cacheable nops to
// overflow LSD
    .fill 12, 1, 0x90
    .align 32
    .fill 12, 1, 0x90
    .align 32
    .fill 12, 1, 0x90
    .align 32
    .fill 12, 1, 0x90
    .align 32

// this first block should fill up way 1 of our uop-cache set
    add ecx, ecx
    nop
    nop
    nop
    add ecx, ecx
    add ecx, ecx

// way 2
    add ecx, ecx
    add ecx, ecx
    add ecx, ecx
    nop
    add ecx, ecx    
    add ecx, ecx    

// way 3
    or ecx, ecx
    or ecx, ecx // <---- an example of offending instruction
    or ecx, ecx
    or ecx, ecx
    or ecx, ecx
    or ecx, ecx // <---- this one as well

// next uop set
    dec rax
    jnz 1b

Obviously it is a nonsensical example. I generally include it as part of a loop containing enough other 32-bytes blocks to overflow the LSD to make things simpler but it's not required AFAICT. For similar reasons I made this block exactly 32 bytes to rule out anything related to dangling instructions.

I measure the usage of DSB vs MITE using the two corresponding perf counters (IDQ.MITE_UOPS and IDQ.DSB_UOPS).

As it is written this 32-byte block will be cached in the DSB, however changing one of the marked or ecx, ecx into an add ecx, ecx is enough to trigger the legacy decoder.

This is surprising to me because both instructions have the same size and both generate 1 uop.

In fact, playing around with similar examples, the only common point I found between instructions that do or do not trigger different caching behavior is whether or not they would macro-fuse with branches if there was any.

I can't find a description of this (or any related) behavior anywhere, is there something I'm missing?

Can you make this a [mcve] that includes a loop around this, so we can try your experiment on other systems? It is known that the legacy decoders do avoid decoding a macro-fusable instruction in the last uop slot in a decode group (optimizing for the case where a `jcc` follows by holding onto it for next group), but it's surprising if that affects how they pack into uop cache lines. Certainly possible, though. — Peter Cordes, May 04 '20 at 03:54
I added the loop to the example. Good point about having to buffer those instructions in the decoder to handle 16byte crossing macro-fusion, thanks. I will try to do some digging in that direction. — carnaval, May 04 '20 at 04:48
IIRC, Agner Fog wrote about that in his microarch pdf. P6-family (up to Nehalem) didn't have a uop cache so macro-fusion would simply not happen if the cmp/jcc were split across decode groups. But SnB-family does delay to optimize for the uop cache. e.g. a stream of `add` instructions will only decode at 3/clock from the legacy decoders, vs. 4/clock for `or` instructions. So IIRC, that's easy to verity. But it's not clear why that would stop it from packing into the uop-cache. It's not like one line has to come from one decode group, lines are wider than decode groups. — Peter Cordes, May 04 '20 at 05:01
It definitely looks related. From what I can tell now, the fallback is triggered any time this situation (4th instruction in the decode winow is rolled over for potential macro-fusion) happens 3 times in a single 32-byte code block. I agree that it sounds like a weird limitation since there must already be an intermediate fill buffer between the MITE and the uop cache to accomodate filling a single way from multiple decode windows. — carnaval, May 04 '20 at 17:24

Why is this code not hitting the micro-op cache on Haswell when changing a single instruction?

0 Answers0