I have listened and read to several articles, talks and stackoverflow questions about std::atomic
, and I would like to be sure that I have understood it well. Because I am still a bit confused with cache line writes visibility due to possible delays in MESI (or derived) cache coherency protocols, store buffers, invalidate queues, and so on.
I read x86 has a stronger memory model, and that if a cache invalidation is delayed x86 can revert started operations. But I am now interested only on what I should assume as a C++ programmer, independently of the platform.
[T1: thread1 T2: thread2 V1: shared atomic variable]
I understand that std::atomic guarantees that,
(1) No data races occur on a variable (thanks to exclusive access to the cache line).
(2) Depending which memory_order we use, it guarantees (with barriers) that sequential consistency happens (before a barrier, after a barrier or both).
(3) After an atomic write(V1) on T1, an atomic RMW(V1) on T2 will be coherent (its cache line will have been updated with the written value on T1).
But as cache coherency primer mention,
The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue)
So, is the following correct?
(4) std::atomic
does NOT guarantee that T2 won't read a 'stale' value on an atomic read(V) after an atomic write(V) on T1.
Questions if (4) is right: if the atomic write on T1 invalidates the cache line no matter the delay, why is T2 waiting for the invalidation to be effective when does an atomic RMW operation but not on an atomic read?
Questions if (4) is wrong: when can a thread read a 'stale' value and "it's visible" in the execution, then?
I appreciate your answers a lot
Update 1
So it seems I was wrong on (3) then. Imagine the following interleave, for an initial V1=0:
T1: W(1)
T2: R(0) M(++) W(1)
Even though T2's RMW is guaranteed to happen entirely after W(1) in this case, it can still read a 'stale' value (I was wrong). According to this, atomic doesn't guarantee full cache coherency, only sequential consistency.
Update 2
(5) Now imagine this example (x = y = 0 and are atomic):
T1: x = 1;
T2: y = 1;
T3: if (x==1 && y==0) print("msg");
according to what we've talked, seeing the "msg" displayed on screen wouldn't give us information beyond that T2 was executed after T1. So either of the following executions might have happened:
- T1 < T3 < T2
- T1 < T2 < T3 (where T3 sees x = 1 but not y = 1 yet)
is that right?
(6) If a thread can always read 'stale' values, what would happen if we took the typical "publish" scenario but instead of signaling that some data is ready, we do just the opposite (delete the data)?
T1: delete gameObjectPtr; is_enabled.store(false, std::memory_order_release);
T2: while (is_enabled.load(std::memory_order_acquire)) gameObjectPtr->doSomething();
where T2 would still be using a deleted ptr until sees that is_enabled is false.
(7) Also, the fact that threads may read 'stale' values means that a mutex cannot be implemented with just one lock-free atomic right? It would require a synch mechanism between threads. Would it require a lockable atomic?