How does Load Store Queue work in the presence of MSHR?

Question

I understand the basic working of load-store queue, which is

when loads compute their address, they check the store queue for any prior stores to the same address and if there is one then they gets the data from the most recent store else from write buffer or data cache.
When stores compute their address, they check load queue for any load violations

My doubt is what happens when

In the first case when the load access data cache due to some unresolved store addresses in the store queue and the access is miss in L1 data cache and before the data can be retrieved from the cache, the store address resolves. Now, the store does load queue checking for any violations. The dependent load has already accessed the data cache prior but didn't receive the value from cache yet due to long latency miss. Does the store post load violation or does it do store-to-load forwarding and cancel the data from cache?
When load access miss in the l1 data cache, then the loads are placed in MSHR so as to not block the execute stage. When the miss resolves, the MSHR entry for that load has information regarding destination register and physical address. So the value can be updated in the physical register but how does the MSHR communicate with load queue that the value is available? when does this happen in the pipeline stage? Because I have read somewhere that MSHR store physical addresses and Load-store queue store virtual addresses. So how does MSHR communicate with LSQ?

I haven't found any resources regarding these doubts.

2: Intel CPUs for example replay the uops waiting for a cache-miss load result in anticipation of it being an L2 hit, then an L3 hit, and then apparently keep replaying them until they eventually succeed. (If those uops are the oldest for that port). [Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?](//stackoverflow.com/q/54084992). And see also the top part of [About the RIDL vulnerabilities and the "replaying" of loads](//stackoverflow.com/a/56188631) - but take careful note of the edit-needed caveat. — Peter Cordes, Jan 24 '21 at 01:52
@Peter, at least in my tests on Skylake they only seem to speculatively dispatch in anticipation of an L1 or L2 hit, not L3 or beyond. That makes sense since L3 hits are not constant latency. So you usually get 3 total dispatches for a miss to L3 or DRAM *if there is a single instruction directly dependent on the load*. You could of course get more if there are more dependent instructions, and it gets especially interesting when you have a chain of dependent loads. — BeeOnRope, Jan 24 '21 at 08:07
@BeeOnRope: Maybe I'm misremembering, but I thought we'd (you'd) seen many extra dispatches over the time for a uop waiting for a cache miss from RAM. Probably that was with a pointer chasing test so we could consistently have exactly one cache-miss load in flight at once that had its address ready. IIRC L2-hit pointer-chasing had 1 extra dispatch, and L3-hit had a couple more, and it seemed L3-miss had enough extra to be explained by starting dispatching every 5 cycles after a certain point. Or something along those lines. — Peter Cordes, Jan 24 '21 at 15:48
@BeeOnRope: Is there a good Q&A with an updated description of uop replay? It seems I never got around to updating some of my answers after we discovered that it's not split loads or cache misses themselves that replay from the RS, it's the uop(s) dependent on them, so pointer chasing misled us. But I had hoped there was an accurate description somewhere outside of comments. Maybe on your wiki? — Peter Cordes, Jan 24 '21 at 15:51
@PeterCordes - yes, exactly you can see many replays per miss (up to ~10, IIRC), but those are in cases of "nested" replays like pointer chasing or in cases where many uops are dependent on the load. I don't recall any repeated dispatch over time for pure load misses as you describe. There are repeated dispatches over time for other cases though, maybe that's what you're thinking of: in the case of store-to-load forwarding you could see a lot of replays of the store over time if it depends on a missing load, or something. — BeeOnRope, Jan 25 '21 at 02:40
I don't know of a good Q&A, the existing stuff is no doubt spread around fairly randomly on existing questions and in chats. I did actually do a deep dive trying to characterize exactly the replay behavior (e.g., how many dependent operations can replay, how long the horizon is for replay to occur, but the behavior was complex, so I didn't write anything detailed about it). — BeeOnRope, Jan 25 '21 at 02:44
@BeeOnRope: I was assuming that load uops waiting for an address weren't different from other uops waiting for data, but you're saying they are. So only load uops aggressively replay in anticipation of an L3 hit or DRAM result, perhaps to try to reduce pointer-chasing latency by 1 cycle vs. just waiting for the data to arrive? So if you had a pointer-chasing loop with a load and 3 add-reg dependent on the pointer-chasing load, you'd expect one replay each of the adds at each step, but many replays of the load? Assuming L3 misses. — Peter Cordes, Jan 25 '21 at 02:48
@PeterCordes - I'm probably not doing a good job of explaining myself. I don't think that dependent load ops are different than other ops in this sense. I don't think that depending load ops retry multiple (other than the 3 times implied by L1 and L2 retry) times in the load-load pointer chasing case, just like other ops. I just mentioned point chasing because it's a case were there are (transitively) many ops waiting for each load, so you could get more than the "standard" 3x replays per miss. I don't recall if this actually happens though? Let me check. — BeeOnRope, Jan 25 '21 at 04:26
Yeah, a pointer chasing benchmark has either 1, 2 or 3 uops to p23 for L1, L2 and L3/DRAM regions respectively. There is no continual replay and no additional reply associated with L3 miss vs L3 hit. This from [uarch-bench `load-serial` tests](https://gist.github.com/travisdowns/9e17806be7167c0b9704fa4646687f97) and the `p23` uop counters. — BeeOnRope, Jan 25 '21 at 04:34
Well I wrote that before the last two rows finished for 131 and 262 MB sized regions. These do approach 4 total uops to p23, rather than 3. However, the smaller tests like 64 MB are clearly out of L3 yet don't show it. I'm not sure what the effect is here, maybe related to page walks? **Update:** I believe it is related to page walks since if I disable THP you see ~4 for all sizes that don't fit in L3. — BeeOnRope, Jan 25 '21 at 04:36
@BeeOnRope: Ok, I must have been remembering some other cause of replays that does keep replaying aggressively. It makes sense that dependent load chains aren't special: If a load doesn't have its address ready yet so it's still waiting in the RS, uops in turn dependent on it shouldn't be trying to issue yet. In pure pointer-chasing, there's only ever one load outstanding at once (dispatched successfully but result not produced yet). — Peter Cordes, Jan 25 '21 at 04:36
@PeterCordes - there is a case like the one you are describing, IIRC it involves stores and store to load forwarding. Something like you see 1 p23 uop (or was it p4, I can't remember) for every cycle a store forwarding gets delayed (e.g., b/c the data isn't ready). — BeeOnRope, Jan 25 '21 at 04:39
Probably in the above case the "extra" replay is caused by TLB miss. — BeeOnRope, Jan 25 '21 at 04:41
@BeeOnRope: Interesting. My memory is of a dispatch attempt every 5 cycles or something like that, but that could have been based on a misunderstanding. — Peter Cordes, Jan 25 '21 at 04:41
@BeeOnRope: Yeah was about to say the same thing; likely an L1dTLB miss produces a extra replay of dependent uops in anticipation of an L2TLB hit. The page walker itself doesn't do its loads via uops AFAIK (instead accessing L1d independently) and has unpredictable latency, so L2TLB miss seems a less likely source of replays. And getting exactly 1 extra replay for your test sizes would make sense for either L2TLB hits or full L2TLB misses that cause a walk. — Peter Cordes, Jan 25 '21 at 04:43
_My memory is of a dispatch attempt every 5 cycles or something like that_ ... yes I do recall something along those lines as well, but IIRC it involved both loads and stores, or perhaps only loads but the fill buffers were full (e.g., it checked periodically for a free fill buffer). — BeeOnRope, Jan 25 '21 at 04:54
From a link-only answer: https://www.youtube.com/watch?v=utRgthVxAYk&list=PLAwxTw4SYaPnhRXZ6wuHnnclMLfg_yjHs&index=59 "Load Store Queue Part 1 - Georgia Tech - HPCA: Part 3". IDK how relevant it actually is; haven't watched it. — Peter Cordes, Jul 19 '22 at 01:24

score 2 · Answer 1 · answered Jan 23 '21 at 07:51

This is speculative execution where loads bypass older stores. When the older store is resolved, we can throw a load violation. If the probability of address aliasing is low then speculative execution is profitable (more throughput) - typically should be true for programs. On detecting a load violation, we can take appropriate step - (a) store-to-load forward, or (b) rollback pipeline to the resolved store.
Same as when loads are served via cache hits (that can take 1-3 cycles for a L1 hit). For example in a reservation station with a CDB (common data bus), the result will be shared with all HW structures that need it.

How does Load Store Queue work in the presence of MSHR?

1 Answers1