Predecoders and decoders. Difference

Question

I am reading Agner Fog's materials and I have some doubts:

The pre-decoders and decoders can handle 16 bytes or 4 instructions per clock cycle

What is pre-decoders in context of decoders?
The author says about cache for macroinstruction. I cannot why it can be useful, after all, we have cache instruction. What is loopback buffer?
What is microoperationsFusion and macroOperationFustion?

The answers to all those questions are right there in the microarchitecture pdf. Search within the pdf to find the description of something, if you forget what was explained earlier. — Peter Cordes, Apr 11 '16 at 23:02
2: decoding x86 instructions is hard, so it makes sense to cache the decode results. — Peter Cordes, Apr 11 '16 at 23:08

score 5 · Accepted Answer · answered Apr 11 '16 at 21:29

"The pre-decoder will find and mark the instruction boundaries, decode any prefixes and check for certain properties (e.g. branches)." (Source) (Another article)
The L1 instruction cache is the main cache for macro-instructions. A loop buffer stores a small sequence of macro-instructions (like 32 bytes) that is useful for tight loops, which saves latency and power compared to reading from the L1 cache.
"The register renaming (RAT) and retirement (RRF) stages in the pipeline are bottlenecks with a maximum throughput of 3 μops per clock cycle. In order to get more through these bottlenecks, the designers have joined some operations together that were split in two μops in previous processors. They call this μop fusion. The fused operations share a single μop in most of the pipeline and a single entry in the reorder buffer (ROB). But this single ROB entry represents two operations that have to be done by two different execution units. The fused ROB entry is dispatched to two different execution ports but is retired as a single unit." (Source)

Macro-op fusion is a way to recognize a sequence of macro instructions that become one micro-op. The most common example is that on newer Intel CPUs, a CMP + JMP fuses into one micro-op.

3. that's for Pentium M, which doesn't have a uop cache, or a loop buffer. I think the OP is reading the Sandybridge section, because that's where I told him to start. Core2 and later have a 4-wide OOO pipeline, so they can rename/issue and retire 4 fused-domain uops per clock. More useful: sections 9.5 and 9.6 on page 124, `Micro-op fusion` and `Macro-op fusion`. — Peter Cordes, Apr 11 '16 at 23:05

1 Answers1