1

I went hunting in the Intel Optimization manual, and for Skylake, I could not find how long it takes to retire an instruction after it has left its execution port assuming no delays.

Can someone please provide this information or give me a reference where I can find the answer. Also, a reference to any paper/document that goes into painful detail about how the retire unit/process works, and its delays, would be much appreciated.

Thanks.

  • 1
    A comment to go with the down vote would be appreciated. – ReverseFlowControl Jan 20 '18 at 19:06
  • I'm not sure how this matters. It's just part of the latency of an instruction. – Ross Ridge Jan 20 '18 at 21:15
  • For the Nahalem and its predecessor, [Page 77, section 2-47, on the Intel Optimization Reference Manual ] the retire unit can process 4 micro-ops per cycle. I could not find those numbers for any other architectures....its not listed. Also, the question is a "matter-of-fact" question. How is this in fact implemented? If the inquiry is of no interest to you, that's fine. – ReverseFlowControl Jan 20 '18 at 22:45
  • Understood. In that case, this question may be closed/removed then. – ReverseFlowControl Jan 20 '18 at 22:57
  • 1
    It seems you already answered your question :) a uops takes a cycle to retire, multiple uops can retire on a given cycle (e.g. 3 for P6, 4 for Netburst, maybe 6 for SKL?). Retirement is just copying temp registers into the architectural ones (or more likely, renaming the latter), Agner Fog's [microarchitecture guide](http://www.agner.org/optimize/microarchitecture.pdf) describe it briefly on page 81. I don't think there is much to say, at least publicly. – Margaret Bloom Jan 21 '18 at 00:02
  • I did not answer my own question. The number of execution units tends to be wider than the number of retiring units, and if you consider how the Reorder Buffer works, then delay between execution and retirement can be in the hundreds of cycles; at the architectural level it may be perceived to be 1 but at the micro-architectural level this is most often not so. What I want to know is how wide the "Retire Unit" is on Skylake. Actually, on all architectures, but my laptop is skylake so I care about that the most. – ReverseFlowControl Jan 21 '18 at 05:23
  • @MargaretBloom: I had already found and read that. Agner's claim, which is probably accurate, is that the Pentium Pro, II and III have 3 micro-ops wide retire unit. He does say a couple of enlightening things, but not enough for my question in the post. – ReverseFlowControl Jan 21 '18 at 06:31
  • @MargaretBloom: Pre-SnB, retirement included copying results from the ROB to the permanent architectural register file. On CPUs with a physical register file, there isn't a separate permanent register file, just pointers into the PRF. I guess there are 2 register renaming tables: one to track issue, and one to track the currently-valid retirement state (so you can roll back to that on mis-speculation, other than branch misses which use an optimized recovery that only goes back to a checkpoint at the branch or something, instead of all the way to the retirement state). – Peter Cordes Jan 21 '18 at 09:02
  • There has been some mention (on Agner Fog's blog http://www.agner.org/optimize/blog/read.php?i=415#580) that Skylake widened the retirement width. (I was thinking I'd read it was 8-wide in SKL, but maybe it's even wider now). This may occasionally help for hyperthreading, by opening up new ROB space faster when the oldest instruction in both threads finishes in the same cycle. I'm not sure exactly why it's useful for it to be much higher than issue bandwidth of 4 uops per clock (total, even with hyperthreading active). – Peter Cordes Jan 21 '18 at 09:07
  • Wikichip says [Haswell can retire 4 uops per hyperthread](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Retirement). Probably with one thread active, it's still only 4 uops, because each hyperthread has its own RAT to track its architectural state. But note that Wikichip's comments about SKL's front-end are bogus. Yes, the uop-cache can deliver up to 6 uops per clock to the IDQ, but the front-end can still only issue 4 per clock. The higher uop-cache / decoder peak bandwidth is there to hide bubbles from cycles where it delivers fewer. – Peter Cordes Jan 21 '18 at 09:13
  • 1
    @PeterCordes - yes, I've been confused about retirement rates much higher than other bottlenecks like RAT. One idea that is that is to free up resources faster. Imagine a scenario where the front-end is stalled on some limited resource, like load/store buffers or (less likely) physical registers, and the retirement "queue" is kind of full (such a queue, AFAIK, doesn't really exist: it's just the head of the ROB composed of executed instructions). The oldest 20 or whatever instructions don't free this resource, but younger ones do. A high retirement rate allows you to free the ... – BeeOnRope Jan 21 '18 at 18:32
  • 1
    ... resources faster, unblocking the front-end which is the bottleneck here. Note that this logic doesn't really work if you consider the limited resource to be "ROB entries", since while having faster retirement frees up "ROB entries" faster, it never really seems to matter (single threaded) since as long as ROB entries are freed at the RAT width, it never seems to matter if you go faster (since the front-end won't bottleneck on entries). When you consider load/store buffers though, which aren't 1:1 with ROB entries, it makes sense. – BeeOnRope Jan 21 '18 at 20:09

1 Answers1

4

The comments to the question already cover the retirement rate, which is the throughput at which instructions can retire once they are the oldest un-retired instructions. This seems to be at least 4 instructions per cycle per thread for recent Intel (Skylake) and 8 instructions per core on AMD (Ryzen).

This rate is at least as wide as other bottlenecks such as renaming (4 on recent Intel, 5 or 6 on recent AMD), so that it is rarely a bottleneck and is hard to measure directly since most tests will bottleneck on something else before you reach the maximum retirement rate.

It seems like that might not be your question though since you wrote:

how long it takes to retire an instruction after it has left its execution port assuming no delays

It isn't clear what you mean by "no delays" but that's a totally different question - how long that takes depends on how many instructions are in front of it waiting to retire and how long they take to retire. I suppose in the worse case, the oldest instruction is stalled (e.g., a long latency miss to DRAM), and then retirement of any younger instructions could take 100 ns or more. Maybe that violates your "no delays" rule though? In the general case, an instruction has to wait for all earlier instructions to retire, which may be many cycles even when things are flowing smoothly.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386