3

I think, to make the CPU continue executing subsequent instructions,the store buffer must do part of the MESI processing to get cache consistency, because the latest value is stored in store buffer and not cache. So the store buffer sends read invalidate or invalidate REQ messages and flushes the latest value to cache after the arrival of ACK.
And Cache cannot do it.

Is my analysis and result right?
Or shall all MESI processing be done by cache?

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
Los Geles
  • 31
  • 2
  • The store buffer does not participate in cache coherence at all because it doesn't have to. Requests in the store buffer get sent to the L1 controller (or what hardware structure in the coherence domain) and then the L1 controller is the one that participates in coherence by requesting ownership for the target cache line. Any subsequent instructions executing on the same logical core will nonetheless use the result of store even before the ownership request gets satisfied. This doesn't violate coherence because other cores cannot see the results of these instructions until the store retires. – Hadi Brais Apr 17 '18 at 17:22
  • I'm assuming that by "cache consistency" you're referring to cache coherence (formally, there is a distinction between them). – Hadi Brais Apr 17 '18 at 17:25
  • 1
    @HadiBrais: A CPU can optimize by sending RFOs early, so lines will become hot in L1d sooner and cache-miss stores aren't delayed so long, vs. if you just wait until a store is ready to commit from the store buffer to L1d. For example, one of Skylake's features is [L1 store misses generate L2 requests much earlier in Skylake than before](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Memory_subsystem). (Intel's optimization manual says that, too.) I'm not sure if that's what the OP is asking. – Peter Cordes Apr 17 '18 at 18:22
  • @LosGeles, why is this required to "continue executing subsequent instructions"? Any younger load would get the store data forwarded regardless of the store status (even before it commits) – Leeor Apr 17 '18 at 18:59
  • @PeterCordes Yes, but the critical point here is that the store cannot retire until the RFO has been granted and any related speculation has been resolved. This is a separate event from receiving the target cache line itself (the RFO can be granted earlier). As you said, the RFO can be sent as soon as the (physical) address of the cache line has been determined speculatively or otherwise. – Hadi Brais Apr 17 '18 at 19:26
  • 2
    The store can retire before the RFO has been granted - but it can't _commit_ (leave the store buffer and become visible at the coherence point) unitl that happens. Stores can stay in the store buffer after retirement. – BeeOnRope Apr 18 '18 at 00:28
  • @BeeOnRope Usually "retire" and "commit" refer to the same thing--release the resources used by the instruction and make its effects globally visible. Intel always uses the term "retire" in their manuals. I mostly use that term too. What distinction are you making between retire and commit? – Hadi Brais Apr 18 '18 at 17:09
  • @HadiBrais - right, _commit_ might not be the right word for the other concept, but I'll use it anyways for lack of a better one for now. The distinction is between _instruction retirement_ and _store commit_. The former is the usual thing everyone seems to call retirement or occasionally commit or sometimes _graduation_ and applies to every instruction: an instruction retires, usually in-order, on an out-or-order design, when all earlier instructions have retired and at this moment the instruction becomes non-speculative. I hesitate to say _globally visible_ since what does that mean ... – BeeOnRope Apr 18 '18 at 18:56
  • ... for any type of instruction other than a store? The other concept is store-specific, which I'll call "store commit", and it's the moment the store leave the store buffer and enters the cache subsystem. AFAIK on some/many designs this is sometimes/usually/always after retirement. So once a store has _retired_, it is non-speculative and will always _commit_ eventually (unless someone yanks the power), but has necessarily become visible to any other CPU yet. It may take a long time to drain from the store buffer. Even if an interrupt or fault occurs, it will must still drain. – BeeOnRope Apr 18 '18 at 18:59
  • @HadiBrais - we disagree about retire, I think it has an established meaning which has nothing to do specifically with stores (but still _applies_ to stores since they are instructions too). So I don't think there is any confusion about the meaning of retire (you agree with my definition above as it applies to all instructions, right?). So AFAICT the only confusion is what do call the second part of what a store does. Apart from naming, the distinction is only rarely interesting at the software level. – BeeOnRope Apr 18 '18 at 19:32
  • One example: something like an exception, fault or interrupt that throws away all un-retired instructions _still_ cannot throw away retired-but-not-yet-committed stores, which can have implications for how long those operations take. Other examples are more philosphical: that the visible total store order might be quite different than the order in which stores _retired_ across CPUs - but this is mostly invisible to software since stores are the primary way to establish order in software between CPUs, so the retirement order isn't material. – BeeOnRope Apr 18 '18 at 19:32
  • It could be observed though via some external coordination mechanism, e.g., signals on external pins or other activity which doesn't go through the store buffer. It could also be observed if CPUs have closely synchronized clocks and polled them before and after stores: you could observe that based on the clock and its known tolerance, store A definitely came before store B, but external observers could see the TSO as B coming before A. – BeeOnRope Apr 18 '18 at 19:37
  • @BeeOnRope My understanding for retiring a store instruction is that two things that need to be done 1- release the resources it occupies 2- create a store request and place it somewhere to be sent to the first coherence point whatever that is. These two actions might be done in some order or concurrently. Once they are done, the store has "retired". So I can imagine that there is some "dark area" between the pipeline and the coherence point where everything is ordered according to consistency and coherency... – Hadi Brais Apr 18 '18 at 20:12
  • ...The store request cannot "reach" the coherence point until it's been granted ownership. If an exception or interrupt occurred, nothing special needs to happen. The in-flight store request dosen't have to be deleted. Whatever instructions that have not completed the two actions mentioned above, only those may need to be flushed. I realize now I'm saying something potentially more precise than before. I said before "the store cannot retire until the RFO has been granted"... – Hadi Brais Apr 18 '18 at 20:12
  • That sounds correct, yeah. The problem was probably only with the use of the word _retire_, which generally has a specific meaning for out-of-order processors, and many do a lot of work for stores (the stuff you mention above) _after_ retirement. Perhaps some people include all that in the concept of retirement, for stores only, but that IMO would be very confusing since retire really has a precise technical meaning. – BeeOnRope Apr 18 '18 at 20:14
  • @BeeOnRope Generally in textbooks and papers when they say "the store has retired" they imply that it has reached the coherence domain as well, *as if there is nothing in between*. This sounds OK since that bit of detail is mostly not important. It's a convenient simplification I guess, which matters when you're the one who has to design that part of the processor. – Hadi Brais Apr 18 '18 at 20:25
  • @HadiBrais - right. It could also be that _retire_ is used in different ways when talking about caching, coherence and load/store machinery, as compared to talking about out-of-order processing in modern CPUs. – BeeOnRope Apr 18 '18 at 20:31

1 Answers1

3

On most designs the store buffer wouldn't directly send invalidate requests and is usually not even snooped1 by external requests. That is, it is part of the private/core-side of the coherence domain and so doesn't need to participate in coherence. Instead, the store buffer ultimately interacts with the first level of the caching subsystem which itself would be responsible for the various parts of the MESI protocol.

How that interaction works exactly depends on the design, of course. A simple design may only process one store at a time: the oldest one that is at the head of the store buffer and perform the Read For Ownership (RFO) for that address, and when complete move on the to the next element. A more sophisticated design might send RFO for several "upcoming" requests in the store buffer in an attempt to exploit more memory-level parallelism. The exact mechanism isn't clear to me on x86: stores to L2 seem to perform quite poorly in some scenarios, but I'm pretty sure a bunch of store misses to RAM will perform much better than if they were handled serially.


1 There are some exceptions, e.g. simultaneous multithreading (hyperthreading on x86) which involves two logical cores sharing all levels of cache and hence being able to avail themselves of the normal cache coherency mechanisms, may require store buffer snoops.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386