2

If I have a cache line of data and the first byte is being atomically modified, can I still read different bytes of data from this cache line concurrently? Or will my attempt to read know about the atomic update taking place and wait for it?

I am trying to understand the performance implications of the above scenario.

user997112
  • 29,025
  • 43
  • 182
  • 361

1 Answers1

2

Cache coherence is maintained on a line granularity, usually 64B in most CPUs these days. The core doing the modification will first request for ownership of the entire line, which means that all other cores must invalidate their copies (if they had such). Any other core attempting a read will have to requests that line, which will result in a snoop getting sent to the modifying core. From there you have two options:

  1. Either the modifying core completed the read-modify-write sequence, and the line sits in its cache with the latest modified data - in this case the snoop will initiate a WB sequence, the updated line will become available to all, and the 2nd core may read any byte from it.

  2. The modifying core acquired the line through the load, but its store has still not made the change (which is possible since stores are usually performed much later in the pipe, while loads are often done speculatively). In this case, the core must protect the line from being snooped, usually through implementing an internal lock for such operations. Note that on x86 for e.g., most atomic read-modify-write operations require a lock prefix. Also note that a normal sequence of read + write (non atomic) would simply lose the line at that point, and acquire it again for the store later, thereby losing coherence.

Edit: following Paul's comment, it's indeed possible to design a cache system that allows sub-line granularity tracking. This basically means decoupling the basic block of the MESI protocol from the basic block size used for caching, you would need to add state bits per each subset (but can still use a single tag for all), invalidate only local subsets, and eventually do a merge somehow to regain the full line. However, the overhead would make it quite rare, and i'm not familiar with commercial CPUs doing all this just to avoid false sharing. Either way, since such a sub block would probably not be byte-sized, the original question still applies for bytes within the same block.

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • "losing coherence" should probably be "losing atomicity with respect to other readers/writers" (i.e., the non-LOCK x86 read-modify-write instructions are only atomic with respect to interrupts). Also, there is nothing inherently preventing an implementation from tracking validity at finer granularity *or* speculating that potentially stale data will not be changed. Of course, such fine points of *potential* implementations are not essential to answering the question (until someone decides to actually implement such features). –  Sep 07 '15 at 19:30