Why std::memory_order_relaxed is preferred at CAS loop when failing?

Question

When it comes to implementing CAS Loop using std::atomic, cppreference in this link gives the following example for push:

template<typename T>
class stack
{
    std::atomic<node<T>*> head;
 public:
    void push(const T& data)
    {
      node<T>* new_node = new node<T>(data);
      new_node->next = head.load(std::memory_order_relaxed);

      while(!head.compare_exchange_weak(new_node->next, new_node,
                                        std::memory_order_release,
                                        std::memory_order_relaxed /* Eh? */));
    }
};

Now, I don't understand how come std::memory_order_relaxed is used for the failure case, because as far as I understand, compare_exchange_weak (same for -strong but I'll just use the weak version for convenience) is a load operation at failure, which means it loads from a successful CAS operation in another thread with std::memory_order_release, and thus it should use std::memory_order_acquire to be synchronized-with instead...?

while(!head.compare_exchange_weak(new_node->next, new_node,
                                  std::memory_order_release,
                                  std::memory_order_acquire /* There you go! */));

What if, hypothetically, the 'relaxed load' gets one of the old values, ending up failing again and again, staying in the loop for extra time?

The following scratchy picture is where my brain is stuck at.

Shouldn't a store from T2 be visible at T1? (by having synchronized-with relation with each other)

So to sum up my question,

Why not std::memory_order_acquire, instead of std::memory_order_relaxed at failure?
What makes std::memory_order_relaxed sufficient?
Does std::memory_order_relaxed at failure mean (potentially) more looping?
Likewise, does std::memory_order_acquire at failure mean (potentially) less looping? (besides the downside of the performance)

The memory order is within one thread. The other thread is unaware of the CAS in other threads. — Bruce Shen, Oct 31 '19 at 03:11
Here is [an example](https://stackoverflow.com/questions/45772887/real-world-example-where-stdatomiccompare-exchange-used-with-two-memory-orde) where the second ordering parameter cannot be `mo_relaxed` — LWimsey, Oct 31 '19 at 08:02
Memory orders ate all about *other* memory locations, they have no effect or significance for the atomic variable you're accessing. — Cubbi, Oct 31 '19 at 17:32
Is your Q somewhat C++11 specific? That is, are you uninterested in any A relating to the current C++ std? — curiousguy, Nov 01 '19 at 00:05
1) I remove the version specific tag as I don't believe you want an answer for that version only of the C++ spec. 2) Your image is rendered really small and the text isn't very readable. — curiousguy, Nov 01 '19 at 06:47
@curiousguy: It's asking about C++ language features introduced in C++11. But yeah, the [tag:stdatomic] already covers the features being asked about so we don't need C++11. — Peter Cordes, Nov 01 '19 at 07:28
@curiousguy I added `c++11` since `std::atomic` was introduced at that version. It doesn't have to be specifically within C++11, though I doubt the meaning of the question may vary depending on which version we're referring to. *"Your image is rendered really small and the text isn't very readable"* Is this still a problem? I've checked it out with my other platforms and the image looks fine to me. — Dean Seo, Nov 02 '19 at 00:36
@Cubbi *"Memory orders ate all about other memory locations,"*, and **including the atomic variable itself** you're accessing when it's of a rel-acq relation, no? — Dean Seo, Nov 02 '19 at 00:38
@DeanSeo The [tag:stdatomic] tag implies "a C++ version that supports `std::atomic`". So the version tag seems redundant, and then it consumes one tag among the four available (since [tag:c++] is mandatory). People interested in particular topics can watch tags so more precise tags is better; I doubt many ppl are following a particular C++ version who are not following [tag:c++]. — curiousguy, Nov 02 '19 at 00:40
@DeanSeo All accesses of an atomic object are inherently ordered, such that reads are after one atomic store, or the initialization, and modifications (stores or RMW = read-modify-write) are in some order, and the read part of the RMW reads the value just written by the previous modification. No special memory visibility is needed to get that, **as it's the absolute minimum expectation for atomics to be usable.** — curiousguy, Nov 02 '19 at 00:44
Btw sorry for the late responses. I was ill the whole day. I read all your comments and gave me a great insight covering this. — Dean Seo, Nov 02 '19 at 00:46
@curiousguy Right, I think the current tags as you edited make more sense too. — Dean Seo, Nov 02 '19 at 00:47

dened · Answer 1 · 2022-02-05T12:53:45.257

2

Stricter memory orders are used to prevent data races and shouldn't improve performance in an already correct program. In the example you provided replacing memory_order_relaxed with memory_order_acquire wouldn't fix any data race and can only decrease performance.

Why there is no data race? Because the while loop work only with a single atomic, which is always data-race-free regardless of the memory order used.

Why memory_order_release is used then in the case of success? It is not shown in the example but it is assumed that access of the head uses memory_order_acquire, e.g.:

T* stack::top() {
  auto h = head.load(std::memory_order_acquire);
  return h ? &h->value() : nullptr;
}

This release-acquire sequence creates a synchronized-with relationship between releasing of a new head and its acquire by another thread.

Thread A	Thread B
st.push(42);	if (auto value = st.top()) { assert(*value == 42); }

In the above example, without release-acquire (if memory_order_relaxed were used instead) the assertion could fail because Thread B could see an incompletely initialized node that head already points to (compiler could even reorder node constructor call below setting head in push()). In other words there would be data race.

edited Feb 05 '22 at 12:53

answered Feb 05 '22 at 11:42

dened

4,253
18
34

`compare_exchange_strong` wouldn't avoid the need to loop here, so it's not better. If another thread won the race, cas_strong would still fail, and we'd still need to use the now-updated `new_node->next` for another CAS attempt. (Otherwise we could just `head.store(release)` instead of CAS at all.) CAS_strong itself requires a loop on an LL/SC machine (like ARM), so a CAS_strong retry loop would be a nested loop. CAS_weak lets the compiler treat spurious failures the same as real failures with a single asm loop; that's the whole point of its existence. – Peter Cordes Feb 05 '22 at 11:49
`compare_exchange_strong` should be used when you need to make exactly one true CAS attempt, and *don't* want to retry right away if it fails, or do some visible side-effect for each failure. So in the opposite case of what you suggested. (Otherwise good answer, useful approach to explaining relaxed there.) – Peter Cordes Feb 05 '22 at 11:50
@PeterCordes I didn't say that we wouldn't need a loop with `compare_exchange_strong`. What you said about the nested loop makes sense, but I can still imagine that on some platforms `compare_exchange_strong` "internal looping" might be a lot more efficient than a normal loop. And if not, then `compare_exchange_strong` is probably implemented as a normal loop and a decent compiler should optimize out the external one. – dened Feb 05 '22 at 12:37
Ah, I see your misconception now. On platforms where CAS_strong is cheap, CAS_weak compiles the same as it and doesn't have spurious failures (e.g. x86, or ARMv8.1). Compilers don't optimize atomics, so no, they don't do that loop transformation on platforms where CAS_strong is expensive. https://godbolt.org/z/zjrffYKGK shows x86 and ARMv8.1 compiling the same both ways, and 32-bit ARM compiling with a nested loop. (ARMv8.0 is for some reason a lot messier than it needs to be with clang, or GCC calls `__aarch64_cas4_acq_rel`, so unfortunately a same-ISA comparison wasn't helpful.) – Peter Cordes Feb 05 '22 at 12:54
I removed the P.S. from the answer since it is a bit controversial. – dened Feb 05 '22 at 12:55
1

The mechanism for CAS_strong as a single hardware instruction isn't "internal looping", it's the hardware locking down that cache line against MESI share or invalidate from the internal load until the internal store. Rather than just finding out on the store attempt that something else had invalidated, like you do with [LL/SC atomics](https://en.wikipedia.org/wiki/Load-link/store-conditional). – Peter Cordes Feb 05 '22 at 12:58

curiousguy · Accepted Answer · 2019-11-01T06:58:29.270

I don't understand how come std::memory_order_relaxed is used for the failure

And I don't understand how you complain about lack of acquire semantic on that failure branch yet don't complain about

head.load(std::memory_order_relaxed);

and then

while(!head.compare_exchange_weak(new_node->next, new_node,
                                  std::memory_order_release

neither of which has an acquire operation "to be synchronized-with" some other operation that you don't show us. What is that other operation that you care about?

If that operation is important show the operation and tell use how that code depends on the "publication" (or "I'm done" signal) of that other operation.

Answer: the push function in no way depends on the publication of any "I'm done" signal by other function, as push does not use other published data, does not read other pushed elements, etc.

Why not std::memory_order_acquire, instead of std::memory_order_relaxed at failure?

To acquire what? In other words, to observe what accomplishment?

Does std::memory_order_relaxed at failure mean (potentially) more looping?

No. The failure mode has nothing to do with the memory visibility; it's a function of the mechanism of the CPU cache.

EDIT:

I just saw the text in your image:

Shouldn't a store from T2 be visible at T1? (by having synchronized-with relation with each other)

Actually you misunderstood synchronized-with: it doesn't propagate the value of the atomic variable that is being read, as an atomic by definition is a primitive usable with a race condition. A read of an atomic always returns the value of the atomic variable, as written by some other thread (or the same thread). If it wasn't the case, then no atomic operation would be meaningful.

No memory ordering is ever needed to read a single atomic variable.

The "failure" memory order is for the pure load of the current value which updates the `expected` arg (taken by reference). IDK what you mean by saying it's anything to do with the CPU cache. — Peter Cordes, Nov 01 '19 at 06:46
@PeterCordes Since the expected value (`new_node->next = head.load(std::memory_order_relaxed);`) was first read w/o an acquire, I assume the update value doesn't one either. — curiousguy, Nov 01 '19 at 06:50
Then the *success* memory order is a release order, to let other functions (such as `pop`) to utilize `head` by establishing the synchronize-with relation? — Dean Seo, Nov 02 '19 at 00:43
@DeanSeo Yes, some reading function, somewhere, needs to be able to import data, that is, to observe the "accomplishment" of another thread: one thread A pushes its accomplishment, which is a "I'm done" event (release) and the other B pulls it ("are you done" test). **That puts the "I'm done" of A in the past of B.** (In general stuff happening in other threads are not automatically in the past or future.) — curiousguy, Nov 02 '19 at 01:23
Looking at this again, when you wrote "*it's a function of the mechanism of the CPU cache.* - did you mean "**that's** a function ...", to say that memory visibility is guaranteed by HW CPU cache, rather than that "it" (the failure mem order) is something to do with CPU cache? — Peter Cordes, Feb 05 '22 at 11:54

Why std::memory_order_relaxed is preferred at CAS loop when failing?

2 Answers2