The Derecho system (open-source C++ library for data replication, distributed coordination, Paxos -- ultra-fast) is built around asynchronous RDMA networking primitives. Senders can write to receivers without pausing, using RDMA transfers into receiver memory. Typically this is done in two steps: we transfer the data bytes in one operation, then notify the receiver by incrementing a counter or setting a flag: "message 67 is ready for you, now". Soon the receiver will notice that message 67 is ready, at which point it will access the bytes of that message.
Intended semantic: "seeing the counter updated should imply that the receiver's C++ code will see the bytes of the message." In PL terms, we need a memory fence between the update of the guard and the bytes of the message. The individual cache-lines must also be sequentially consistent: my guard will go through values like 67, 68, .... and I don't want any form of mashed up value or non-monotonic sequencing, such as could arise if C++ reads a stale cache line, or mistakenly holds a stale value in memory. Same for the message buffer itself: these bytes might overwrite old bytes and I don't want to see some kind of mashup.
This is the crux of my question: I need a weak atomic that will impose [exactly] the needed barrier, without introducing unwanted overheads. Which annotation would be appropriate? Would the weak atomic annotation be the same for the "message" as for the counter (the "guard")?
Secondary question: If I declare my buffer with the proper weak atomic, do I also need to say that it is "volatile", or will C++ realize this because the memory was declared weakly atomic?