x86 decoding of multi-uop instructions

Question

Decoding becomes more efficient because an instruction that generates one fused μop can go into any of the three decoders while an instruction that generates two μops can go only to decoder D0.

I know that the decoders take x86 machine code as input (like the assembler output from mov eax, eax), and produce micro-ops as output.

How is it determined which decoder should decodes particular instruction before decoding? Maybe pre-decoders?

Peter Cordes · Accepted Answer · 2016-04-25T14:14:15.797

Agner's microarch PDF explains decoding, and what happens with multi-uop instructions.

If a multi-uop instruction isn't the first insn in the block being decoded, decoding ends at that insn. In the next cycle, decoding starts at the multi-uop insn, so it will hit the complex decoder that can handle multi-uop instructions.

This is why a 3-1-3-1 repeating pattern decodes better than a 3-3-1-1 repeating pattern.

The pre-decoders only mark instruction lengths/boundaries. They don't yet know which insns will decode to multiple uops. That requires actually decoding the instructions, so there's no way to shuffle the instruction stream around to send the complex instructions to the complex decoder.

This is why instruction ordering matters when you're bottlenecked on the decoders. For CPUs with a uop cache, decode performance isn't usually critical. If it is, you have a code-size issue. It's hopefully rare for code to run often enough for its performance to matter, but infrequently enough for it not to be hot in the uop cache.

x86 decoding of multi-uop instructions

1 Answers1