Memory order in shared pointer destructor

Question

I'm trying to figure out the most relaxed (and correct) memory order for shared pointer destructor. What I have in mind for now is as follows:

~shared_ptr() {
   if (p) {
     if (p->cnt.fetch_sub(1, std::memory_order_release) == 1) {
       p->cnt.load(std::memory_order_acquire);
       delete p;
     }
   }
 }

Basically, I'm thinking that all previous fetch_sub() should happen-before delete p;, and by p->cnt.load(std::memory_order_acquire);, I construct a release sequence that ensures this.

I'm new to C++ memory model, and not quite confident. Is my above reasoning correct, and the memory order I specified correct and most relaxed?

Shouldn't you be using `atomic_thread_fence` instead of using a call to `p->cnt.load(std::memory_order_acquire);` to force ordering. — keith, Mar 05 '18 at 14:34
@keith Will this effectively make the `fetch_sub` `std::memory_order_acq_rel`? — Lingxi, Mar 05 '18 at 14:43
Is that a complete `std::shared_ptr`? You destroy the control block? — curiousguy, Mar 07 '18 at 05:02
@curiousguy The `delete p` destroys the control block. The complete code is [here](https://github.com/Lingxi-Li/lock_free/blob/master/lf/shared_ptr.hpp#L51). — Lingxi, Mar 07 '18 at 05:11
OK but I see no room for weak ptr (a strong ref to the control block that is not a strong ref to the data) here so not std conformant — curiousguy, Mar 07 '18 at 05:36
The name is misleading. The code is more suggestive of an intrusive ref-counted pointer such as the one from [Boost](http://www.boost.org/doc/libs/1_61_0/libs/smart_ptr/intrusive_ptr.html). These don't support weak pointers. — Arne Vogel, Mar 07 '18 at 12:53
@keith That's the way I've done it in the past, but it seems to me now that the acquire load is sufficient because (coming after the `fetch_sub` to zero on the same thread), it is guaranteed to see all updates from threads that have previously decremented the counter to non-zero, and prevents any following reads and writes (from `delete`) from being reordered before itself and therefore the `fetch_sub`. (BTW, for completeness, the `fetch_add` on copying the pointer can be relaxed.) — Arne Vogel, Mar 07 '18 at 12:58

score 10 · Accepted Answer · edited Jun 20 '20 at 09:12

In theory, you may have the most efficient code, since there is no more synchronization than necessary.

But in practice, there are almost no CPU that provides instruction that would map perfectly to acquire/release memory order (maybe future ARMv8.3-A will). So you will have to check for each target the generated code.

For example on a x86_64 fetch_sub(std::memory_order_acq_rel) and fetch_sub(std::memory_order_release) will result in the exact same instruction.

So while in theory your code looks like optimal, in practice, you get a code that is less optimal than if you had opted to a simpler approach:

std::atomic<int> cnt;
int* p;
void optimal_in_therory() {
     if (cnt.fetch_sub(1, std::memory_order_release) == 1) {
       cnt.load(std::memory_order_acquire);
       delete p;
   }
}
void optimal_in_practice_on_x86_64() {
     if (cnt.fetch_sub(1, std::memory_order_acq_rel) == 1) {
       delete p;
   }
}

Assembly:

optimal_in_therory():
  lock sub DWORD PTR cnt[rip], 1
  je .L4
  rep ret
.L4:
  mov eax, DWORD PTR cnt[rip]  ;Unnecessary extra load
  mov rdi, QWORD PTR p[rip]
  mov esi, 4
  jmp operator delete(void*, unsigned long)
optimal_in_practice_on_x86_64():
  lock sub DWORD PTR cnt[rip], 1
  je .L7
  rep ret
.L7:
  mov rdi, QWORD PTR p[rip]
  mov esi, 4
  jmp operator delete(void*, unsigned long)

One day I will live in Theory, because in Theory every thing goes well -Pierre Desproges

Why does the compiler keep this extra-load?

According to the standard optimizers are allowed to elide redundant load performed on non volatile atomics. For example if in your code you added three extra-loads:

cnt.load(std::memory_order_acquire);
cnt.load(std::memory_order_acquire);
cnt.load(std::memory_order_acquire);

With GCC or Clang the three loads would appear in the assembly:

mov eax, DWORD PTR cnt[rip]
mov eax, DWORD PTR cnt[rip]
mov eax, DWORD PTR cnt[rip]

This is a realy bad pessimization. My opinion is that it is kept as-is because of an historical confusion between "volatility" and "atomicity". While almost all programmers know that a volatile does not have the properties of an atomic variable, many code are still written with the idea that an atomic has the propertie of a volatile: "an atomic access is an observable behaviour". According to the standard it is not (an explicit example note about this fact in the standard). This is a recurring question on SO.

So your code is realy the optimal code in theory, and it is pessimized because compilers optimize code as if atomics were also volatiles.

The work around could be to replace the load by an atomic_thread_fence as proposed by Kieth in its comment. I am not an hardware expert but I imagine that such a fence could cause more memory "synchronization" than necessary (or at least in theory ;)).

Why I believe your code is the optimal in theory?

The last shared_ptr of a single object must call the destructor of that object without itself causing a data race. The desctructor may access the value the object, so the desctructor call must happen after the "invalidation" of the pointer to the object.

So delete p; must "happen after" the destructor call of all others shared pointers that were sharing the same pointed object.

In the standard happens before is defined by the following paragraph:

[intro.races]/9:

An evaluation A inter-thread happens before an evaluation B if:

A synchronizes with B, or [...]

any combination with "sequenced before" and it is a transitive rule.

[intro.races]/10:

An evaluation A happens before an evaluation B (or, equivalently, B happens after A) if:

A is sequenced before B, or

A inter-thread happens before B.

So there must be a "synchronize with" relation between the fetch_sub that is sequenced before delete p and the other fetch_sub.

According to [atomics.order]/2:

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

So delete p must be sequenced after an acquire operation which load a value which is in the release sequence of all the other fetch_sub.

According to [expr.races]/5 the last fetch_sub (in the modification order of cnt) will belong to the release sequence of all the other release fetch_sub because a fetch_sub is a read-modify-write operation, as is fetch_add (supposing no other operations happen on cnt).

So delete p will happen after all the other fetch_sub, and it is only before delete p is called that a "synchronization" will be produced. Precisely not more that what is necessary.

Well, I try to write portable code, so I will stand on the theory side ;) — Lingxi, Mar 06 '18 at 01:55
Will `(void)p->cnt.load(std::memory_order_acquire);` make things better, possibly remove the unnecessary load? — Lingxi, Mar 06 '18 at 01:57
@Lingxi `(void)x;` and `x;` are both discarded value expression, this should not change the code. — Oliv, Mar 06 '18 at 08:22
@Lingxi I made some addition. I you are "new" I can not imagine what you will be when you will not consider yourself as "new"! — Oliv, Mar 06 '18 at 09:59
Regarding fences, it looks like using `std::atomic_thread_fence(std::memory_order_acquire);` in place of the load is apparently perfect. It is a no-op on x86, and on arm64 it is `dmb ishld` which sounds like exactly enough (the fetch_sub is also "better" with the non-aquiring memory order: using `ldxr` not `ldaxr`). https://godbolt.org/z/EnWM8a — , Aug 04 '20 at 05:36

Memory order in shared pointer destructor

1 Answers1