In MESI cache coherence protocol, when exactly does the state of a cache line change if the data needs to be fetched from memory?

Question

In MESI protocol when a CPU:

Performs a read operation
Finds out the cache line is in Invalid state
There is no other non-invalid copies in other caches

It will need to fetch the data from the memory. This will take a certain number of cycles to do this. So does the state of the cache line change from (I) to (E) instantly or only after data is fetched from memory?

score 2 · Accepted Answer · answered Nov 18 '19 at 12:11

I think a cache would normally wait for the data to arrive; when it's not there yet you can't actually get a hit in cache for other requests to the same line, only to other lines that actually are present (hit under miss). Therefore the state for that line is still Invalid; the data for that tag isn't valid, so you can't set it to a valid state yet.

You'd want another miss to same line (miss under miss) to notice there was already an outstanding request for that line and attach itself to that line-request buffer. (e.g. Intel x86 LFB = line fill buffer). Since finding Invalid triggers looking at fill buffers but Exclusive doesn't, you want Invalid based on this reasoning as well.

e.g. the Skylake perf-counter event mem_load_retired.fb_hit counts, from perf list output:

[Retired load instructions which data sources were load missed L1 but hit FB due to preceding miss to the same cache line with data not ready.
Supports address when precise (Precise event)]

In a cache in a very old / simple or toy CPU with no memory-level parallelism (whole pipeline or just memory access totally stalls execution until the data arrives), the distinction is meaningless; nothing else happens to cache while the requested data is in-flight.

In such a CPU it's just an implementation detail. (Except it should still process MESI requests from other cores while a load is in flight so again tags need to reflect the correct state, otherwise it's extra stuff to check when deciding how to reply.)

instinct71 · Answer 2 · 2019-11-18T22:11:10.323

1

After data is fetched from memory.

In practice, MESI (or any other protocol) has many transition states in addition to the main states of M/E/S/I. In your example, the coherence protocol would transition to a "Wait for Data Fill" state and will transition to E only after data is fetched and valid bit is set.

Reference: Cache coherence protocols in gem5/ruby-- http://learning.gem5.org/book/part3/MSI/cache-transitions.html (search for "was invalid, going to shared") may be useful.

edited Nov 18 '19 at 22:11

answered Nov 18 '19 at 19:00

instinct71

359
2
8

Can you link a reference about real CPU designs actually having these other temporary states, and using extra bits to record it? It makes sense that you'd have a state that means "look for an existing fill buffer, the line is on its way", but I'm mostly a software guy (interested in architecture for performance reasons) and haven't heard of it. – Peter Cordes Nov 18 '19 at 21:05
I learnt about such transition states while modifying cache coherence protocols in gem5/opal. For example: http://learning.gem5.org/book/part3/MSI/cache-transitions.html (search for "was invalid, going to shared") may be useful. Please let me know if I should add this link to my main answer. Thanks. – instinct71 Nov 18 '19 at 21:09
Yeah, worth moving that comment + link into your answer. It's not definitive that real hardware would work that way, e.g. instead of bits in each cache tag to indicate which line-fill buffer is already allocated, a load that finds "Invalid" might just probe all the fill buffers to see if one of them is already waiting for that line. Especially if checking all the fill buffers is something that ever needs to happen for another reason, so the parallel comparator / select hardware would already need to exist. (e.g. for x86 NT loads from WC memory where data only sits in LFBs, not cache) – Peter Cordes Nov 18 '19 at 21:26
1

Your intuition is absolutely correct. In addition to having a valid bit for each cache line (see [lecture slide 57] (https://safari.ethz.ch/architecture/fall2018/lib/exe/fetch.php?media=onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pdf) caches have MSHRs (Miss Status Handling Registers). The job of MSHRs (same notes - slide 113) is to coalesce requests to a cache line to avoid redundant memory requests and consequent transfers. – instinct71 Nov 18 '19 at 22:07
Ok, so the actual MESIF / MOESI status bits probably stay Invalid until data arrives, and the MSHR tracks status of the transition? With 3 bits per tag for MSEIF instead of just 2 for MESI I guess you'd have room to encode some other states if you wanted to do it in the cache tags themselves. – Peter Cordes Nov 18 '19 at 22:15
The coherence protocol state machine (SM), and the coalescing mechanism are independent. From the POV of a cacheline, MSHRs are the gateway to memory. When a cache miss happens, (1) a memory request is issued to MSHR logic (that can relay that to memory if necessary. (2) coherence SM transitions from invalid to "wait for fill, to E". When the memory request is completed, MSHRs update the valid bit, that in turn triggers the coherence SM's transition to Exclusive. Adding intermediate transitions does increase tag overhead but only minimally. (tag is ~32b, data is 64B) – instinct71 Nov 18 '19 at 22:51
It's not obvious what benefit there is to having the actual tags record "wait for fill, then E" instead of still "Invalid". Other accesses would still go to the MSHRs and find there was already one waiting for this load. Unless the tag records *which* MSHR got allocated to track that incoming line? Or does the outstanding load affect how you reply to other MESI requests? Like would an invalidate have to squash the incoming load? I'm used to thinking about Intel CPUs pre-SKX with a shared *inclusive* L3 whose tags act as a snoop filter for on-chip accesses so my general intuition may be off – Peter Cordes Nov 18 '19 at 23:00
1

The reason for intermediate states (pg 387 Parallel Computer Architecture by Culler/Singh) is the non-atomic nature of bus/network. The book has a write-up with a picture -- much better than what I would be able to write. Amazon preview at section 6.2.5 (https://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433) should be enough. – instinct71 Nov 18 '19 at 23:41

In MESI cache coherence protocol, when exactly does the state of a cache line change if the data needs to be fetched from memory?

2 Answers2