3

Something I've been wondering for a while, but firstly, one assumption to make is that all μops produced by a macro-op could have the same rip as the macro-op (I'm pretty sure that the IQ would have a rip for each IFETCH block and the decoders could easily translate the rip + the offset of the macro-op for the rip of the macro-op based on the length information). The IDQ is 8 lines of 32 bytes for each logical core on SnB (might be 64 bytes in recent microarchitectures but I'm not sure), which begs the question of the format of the μops in the IDQ -- whether there is an address per IDQ line and a jmp instruction causes a new line to be started, similar to the μop cache; where, from what I've construed from page 47 of the optimisation manual, a 32 byte aligned region can span 3 ways but the final way must end in a jmp presumably so that it can reinitiate the fetch for the next instruction window by steering the pipeline to that address (which may also jump back into the μop cache or may have to fire up the legacy decode pipeline). It would allow for easy moving of the μop cache ways to the IDQ if the IDQ were of the same structure (so I do feel like the IDQ could have a single address at the start of each line rather than per instruction as it doesn't need to if the instructions on the line have contiguous rips and an instruction after a branch, return, etc. would be on a new line). It would also allow the LSD to lock down and detect loops more efficiently as it would only have to scan the 8 addresses at the start of the lines to check if it is the same as the jmp address for instance. But again, I'm not sure how the LSD is implemented precisely; sources seem to pin a value of 28 μops as the maximum loop that can be detected.

There is also the complication of the Stack Engine and how it places its synchronisation operations; reading Agner fog's section in microarchitecture.pdf on the Stack Engine shows that the synchronisation μop is inserted before the mov or add that causes it to require rsp synchronisation, so it would have to take the rip of that instruction in case there is a ret before the original mov or add (so it can be compared to rsp to check the RSB prediction on whatever port handles ret, BEU?(*)). I would also suggest that the Stack Engine works alongside the decoders so that it inserts at the same time as the decoders do so it doesn't have to shift along instructions later on in order to insert it. There would also have to be a bit indication on the synchronisation op to inform the allocator to discount its bytes when calculating the rips relative to the address at the start of the line when issuing them to the ROB. It could also start a new line for the instruction after the synchronisation op but that seems expensive perhaps.

Something that stops this logic dead in its tracks is simply that the rip of an instruction cannot be worked out from the byte offset of the μop on the line from the address at the start of the line because the length of the macro-ops is not the same as the length of the μops. This could solved by having each IDQ line correspond to one macro-op each (with any synchronisation ops appended to the end of the line, i.e. mov+synchop on one line and the ret that I mentioned earlier, on the line topographically below, with a rip at the start of each line), which seems wasteful I suppose. The only alternative I can think of would be tagging addresses inline for each macro-op which seems messy.

Does anyone have anything to add or correct about how this might be implemented?

(*) This does link to the question of how branch mispredictions are handled, for instance, when a predicted-taken branch instruction is allocated to the RS, one of the parameters could be the rip of the ROB entry of the instruction after it BEU can steer the pipeline on a misprediction. When a ret is allocated, one of the parameters would have to be the ROB entry of the uop after it and another parameter the rsp so it can be compared to it.

Lewis Kelsey
  • 4,129
  • 1
  • 32
  • 42
  • IDQ sizes are measured in uops, not bytes. e.g. Skylake has a 64-uop IDQ. (Replicated for each thread, so in SMT mode each thread has its own 64-uop queue). Where are you getting this "8 lines of 32 bytes"? – Peter Cordes May 23 '19 at 20:05
  • *but the final way must end in a `jmp`* No, you have that backwards. Any `jmp` will end a uop cache "way" (aka line). But a line can also simply end because the start of the next x86 instruction is on the other side of a 32-byte boundary. (IIRC, an instruction that spans a 32 or 64-byte boundary goes into the uop cache line according to where its *first* byte is located.) – Peter Cordes May 23 '19 at 20:10
  • LSD size of 28 uops comes from IDQ size on SnB / IvB. Physically the IDQ is probably a circular buffer with a head and tail (not actually copying all the data to the next entry). The LSD just "locks down" the uops in the queue when a jump back to an uop in the IDQ is identified (and meets other heuristics). So SKL has an LSD limit of 64 uops, if microcode updates haven't disabled the LSD. HSW in single-thread mode (not HT) should has a 56-entry LSD. See [Is performance reduced when executing loops whose uop count is not a multiple of processor width?](//stackoverflow.com/q/39311872). – Peter Cordes May 24 '19 at 02:32

0 Answers0