Something I've been wondering for a while, but firstly, one assumption to make is that all μops produced by a macro-op could have the same rip
as the macro-op (I'm pretty sure that the IQ would have a rip
for each IFETCH block and the decoders could easily translate the rip
+ the offset of the macro-op for the rip
of the macro-op based on the length information). The IDQ is 8 lines of 32 bytes for each logical core on SnB (might be 64 bytes in recent microarchitectures but I'm not sure), which begs the question of the format of the μops in the IDQ -- whether there is an address per IDQ line and a jmp
instruction causes a new line to be started, similar to the μop cache; where, from what I've construed from page 47 of the optimisation manual, a 32 byte aligned region can span 3 ways but the final way must end in a jmp
presumably so that it can reinitiate the fetch for the next instruction window by steering the pipeline to that address (which may also jump back into the μop cache or may have to fire up the legacy decode pipeline). It would allow for easy moving of the μop cache ways to the IDQ if the IDQ were of the same structure (so I do feel like the IDQ could have a single address at the start of each line rather than per instruction as it doesn't need to if the instructions on the line have contiguous rip
s and an instruction after a branch, return, etc. would be on a new line). It would also allow the LSD to lock down and detect loops more efficiently as it would only have to scan the 8 addresses at the start of the lines to check if it is the same as the jmp
address for instance. But again, I'm not sure how the LSD is implemented precisely; sources seem to pin a value of 28 μops as the maximum loop that can be detected.
There is also the complication of the Stack Engine and how it places its synchronisation operations; reading Agner fog's section in microarchitecture.pdf on the Stack Engine shows that the synchronisation μop is inserted before the mov
or add
that causes it to require rsp
synchronisation, so it would have to take the rip
of that instruction in case there is a ret
before the original mov
or add
(so it can be compared to rsp
to check the RSB prediction on whatever port handles ret
, BEU?(*)). I would also suggest that the Stack Engine works alongside the decoders so that it inserts at the same time as the decoders do so it doesn't have to shift along instructions later on in order to insert it. There would also have to be a bit indication on the synchronisation op to inform the allocator to discount its bytes when calculating the rip
s relative to the address at the start of the line when issuing them to the ROB. It could also start a new line for the instruction after the synchronisation op but that seems expensive perhaps.
Something that stops this logic dead in its tracks is simply that the rip
of an instruction cannot be worked out from the byte offset of the μop on the line from the address at the start of the line because the length of the macro-ops is not the same as the length of the μops. This could solved by having each IDQ line correspond to one macro-op each (with any synchronisation ops appended to the end of the line, i.e. mov
+synchop on one line and the ret
that I mentioned earlier, on the line topographically below, with a rip
at the start of each line), which seems wasteful I suppose. The only alternative I can think of would be tagging addresses inline for each macro-op which seems messy.
Does anyone have anything to add or correct about how this might be implemented?
(*) This does link to the question of how branch mispredictions are handled, for instance, when a predicted-taken branch instruction is allocated to the RS, one of the parameters could be the rip
of the ROB entry of the instruction after it BEU can steer the pipeline on a misprediction. When a ret
is allocated, one of the parameters would have to be the ROB entry of the uop after it and another parameter the rsp
so it can be compared to it.