C++: Which weak atomic to use for buffers that receive async. RDMA transfers?

Question

The Derecho system (open-source C++ library for data replication, distributed coordination, Paxos -- ultra-fast) is built around asynchronous RDMA networking primitives. Senders can write to receivers without pausing, using RDMA transfers into receiver memory. Typically this is done in two steps: we transfer the data bytes in one operation, then notify the receiver by incrementing a counter or setting a flag: "message 67 is ready for you, now". Soon the receiver will notice that message 67 is ready, at which point it will access the bytes of that message.

Intended semantic: "seeing the counter updated should imply that the receiver's C++ code will see the bytes of the message." In PL terms, we need a memory fence between the update of the guard and the bytes of the message. The individual cache-lines must also be sequentially consistent: my guard will go through values like 67, 68, .... and I don't want any form of mashed up value or non-monotonic sequencing, such as could arise if C++ reads a stale cache line, or mistakenly holds a stale value in memory. Same for the message buffer itself: these bytes might overwrite old bytes and I don't want to see some kind of mashup.

This is the crux of my question: I need a weak atomic that will impose [exactly] the needed barrier, without introducing unwanted overheads. Which annotation would be appropriate? Would the weak atomic annotation be the same for the "message" as for the counter (the "guard")?

Secondary question: If I declare my buffer with the proper weak atomic, do I also need to say that it is "volatile", or will C++ realize this because the memory was declared weakly atomic?

At first glance, it seems to me that an acquire load of the guard is sufficient. That prevents any loads of the data from becoming visible before the update of the guard. After that, it seems like it's safe to read the data itself as ordinary loads, since they don't need to happen in any particular order; no `volatile` or `atomic` needed here. Presumably there is a store at the other end to signal the sender that it is okay to overwrite the buffer, and that store should be release. — Nate Eldredge, Sep 11 '21 at 15:45
If the buffer needs to be non-cacheable, so that the reads hit RAM instead of your cache, that's between you and your OS or hardware; C++ provides no way to control that. — Nate Eldredge, Sep 11 '21 at 15:47

G. Sliepen · Accepted Answer · 2021-09-11T15:51:24.043

2

An atomic counter, whatever its type, will not guarantee anything about memory not controlled by the CPU. Before the RDMA transfer starts, you need to ensure the CPU's caches for the RDMA region are flushed and invalidated, and then of course not read from or write to that region while the RDMA transfer is ongoing. When the RDMA device signals the transfer is done, then you can update the counter.

The thread that is waiting for the counter to be incremented should not reorder any loads or stores done after reading the counter, so the correct memory order is std::memory_order_acquire. So basically, you want Release-Acquire ordering, although there is nothing to "release" in the thread that updates the counter.

You don't need to make the buffers volatile; in general you should not rely on volatile for atomicity.

edited Sep 11 '21 at 15:51

answered Sep 11 '21 at 15:42

G. Sliepen

7,637
1
15
31

As a remark, though, we certainly can read the region while the RDMA transfer is underway: the memory subsystem will still be cache-line sequential which is enough to ensure a form of safety. For example, our guard variable will always yield a value that was written at some legitimate point in time and will advance monotonically because the writer advances it monotonically. – Ken Birman Sep 11 '21 at 18:31
I don't know the details of the RDMA devices you are using, so you know best how they work. However in general you should be cautious making assumptions; consider for example that, if the RDMA region is marked as cacheable, doing sequential reads might cause the prefetcher in your CPU to start reading ahead of the actual load instructions. – G. Sliepen Sep 11 '21 at 19:49
Agreed. This was our reasoning when we tagged that memory region as volatile, but with the new weak atomics and the deprecated view of volatile, we realized that our old approach would eventually cease to work correctly! My students are studying the exact semantics of the release-acquire memory property, but you have me convinced! – Ken Birman Sep 12 '21 at 20:41

C++: Which weak atomic to use for buffers that receive async. RDMA transfers?

1 Answers1