Can a lock-free atomic write / consistent read operation be achieved on a 4 byte int using System V shared memory across language platforms?

Question

I want to implement a lock free counter, a 4-byte int, in System V shared memory. The writer is a C++ program, the reader is a Python program. Working roughly like this:

C++ code updates counter in an atomic operation
Python code reads counter and has a consistent view of memory (eventual consistency is perfectly acceptable)
No locks are implemented to achieve this

Within the C++ language there are atomic get/update operations that allow for this and guarantee memory consistency, I believe the same is true in Python.

However, as I understand it, the assumptions in C++ regarding atomic operations do not necessarily apply to code written and compiled in another language and compiler.

Is there a way to achieve a consistent view of shared memory across languages that doesn't involve implementing low level locks?

I think my answer might not be the best, but thanks to Cunningham's Law now you should receive the most information that SO wants to provide. — Superlokkus, Nov 05 '21 at 17:06

doodspav · Accepted Answer · 2021-11-10T19:21:19.050

4

Yes, using the atomics library, along with a suitable shared memory library (e.g. mmap or shared_memory).
This example assumes your atomic int is in the first 4 bytes of the shared memory segment.

from atomics import atomicview, MemoryOrder, INT
from multiprocessing import shared_memory


# connect to existing shared memory segment
shmem = SharedMemory(name="test_shmem")

# get buf corresponding to "atomic" region
buf = shmem.buf[:4]

# atomically read from buffer
with atomicview(buffer=buf, atype=INT) as a:
    value = a.load(order=MemoryOrder.ACQUIRE)

# print our value
print(value)

# del our buf object (or shmem.close() will complain)
del buf

# close our shared memory handle
shmem.close()

We can use ACQUIRE memory order here rather than the default SEQ_CST.

The atomicview can only be created and used with a with statement, so you will need to manually keep your buf around (and manage its lifetime correctly).

Note: I am the author of this library

edited Nov 10 '21 at 19:21

answered Nov 10 '21 at 17:19

doodspav

290
3
11

1

This is very cool. Can you comment on how this would interact with a C++ application writing the int using standard C++ atomic operations. – David Parks Nov 10 '21 at 18:46
2

On the C++ end you would place a `volatile std::atomic` object into that buffer and use `.store(val, std::memory_order_release)` (or a stronger memory order). It needs to be `volatile` since the compiler *is* allowed to optimise out atomic operations. (If you want, I can write up a small C++/Python example, using `boost::interprocess` for the C++ shared memory access). – doodspav Nov 10 '21 at 19:14
2

To be clear, `std::atomic::load()` in C++ and `AtomicIntView.load()` in Python (where `width=4`) should have exactly the same side effects in memory (the Python function calls `atomic_load` from `` in C). The only difference is the Python function is guaranteed to be lock-free (in C++ you have to check). – doodspav Nov 10 '21 at 19:17
Thanks, that's what I was looking for. No need for an example, I think that squarely falls into my realm of responsibility. This was very helpful! – David Parks Nov 10 '21 at 19:52

score 1 · Answer 2 · answered Nov 05 '21 at 17:00

Is there a way to achieve a consistent view of shared memory across languages that doesn't involve implementing low level locks?

No, not in general.

First I would say this has nothing to do with languages but more with the actual platforms, architecture, implementation or operating system.

Because languages differ quite strongly, take Python for example: It has no language native way of accessing memory directly or lets say in an low level manner. However some of its implementations do offer its own API. Languages intended for such low level use have abstractions for that, as C, C++ or Rust have. But that abstractions are then implemented often quite differently, as they often depend on where the code is run, interpreted or compiled for. An integer for some architectures are big endian, on most, like x86 or arm, its little endian. Operating systems also have a say, as for example memory is used and abstracted.

And while many languages have a common abstractions of linear memory, its gets even messier with atomics: The compiler for C++ could generate machine code i.e. assembly that check if the CPU run on does support new fancy atomic integer instructions and use them or fall back on often supported atomic flags plus the integer. It could just rely on operating system, a spin lock, or standardized APIs defined by POSIX, SystemV, Linux, Windows if the code has the luxury of running in an operating system managed environment .

For non imperative languages it gets even messier.

So in order to exchange data between languages, the implementations of those languages have to use some common exchange. They could try to do that directly via shared memory for instance. This is then called an Application Binary Interface (ABI), as a memory abstraction is at least common to them a priori. Or the operating system or architecture might even standardized such things or even supports APIs.

System V would be an API designed for such interchange but since AFAIK it does not have an abstraction for atomics or lock less abstractions, the answer stays no, even with the System V context out of the title.

Why would endianness be relevant? We're not talking about heterogeneous shared memory with a MIPS in big-endian mode sharing a cache-coherent view of memory with an ARM or another MIPS in little-endian mode. And re: OS, the question already specified using System V shared memory, which implies each process has the same physical page mapped into its virtual address space. (With coherent cache.) — Peter Cordes, Nov 06 '21 at 01:59
But yes, having shared memory is not enough, you do need a portable API in whatever language to do a 4-byte write or read that can't optimize away. All modern mainstream 32 and 64-bit systems do guarantee atomicity for aligned 4-byte loads/stores when done with a single instruction, though. (e.g. for x86, [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/a/36685056) shows why its atomic in asm, but why you still need `std::atomic` with mo_relaxed in C++, although that is safe to use in shared memory) — Peter Cordes, Nov 06 '21 at 02:11

score 1 · Answer 3 · answered Nov 05 '21 at 17:54

1

I'm going to take issue with some of Superlokkus' assertions.

The mmap primitive is available in both C++ and Python. That primitive gives you memory that is in a physical page shared by both processes. Different virtual addresses, same physical page of memory. Changes by one process are immediately viewable by the other process. That's how it has to work. It operates at a hardware level. Abstractions are irrelevant.

Now, that DOESN'T mean you can get notification of those changes. If you are polling in a loop (presumably a friendly loop with sleeping in between checks), then you will see the change the next time you check.

answered Nov 05 '21 at 17:54

Tim Roberts

48,973
4
21
30

Does the fact that 4 bytes need to be written in an atomic manner affect this? Does System V Shared Memory guarantee that the C++ code that performs an atomic 4-byte int update is seen by Python consistently, or is there the possibility that Python will read a partially written `counter` in this case? I can see how the memory page will be consistent between processes (e.g. no CPU cache buffer issues at play), but the risk of a race condition on a partial write of `counter` still feels like a thorny issue to me that gives me pause to think. We can assume a friendly polling reader here. – David Parks Nov 05 '21 at 18:39
1

All modern PC systems do aligned 32-bit reads and writes atomically. `mmap` will deliver a page-aligned address, so as long as you do not intentionally write to byte 3, atomicity is guaranteed. – Tim Roberts Nov 05 '21 at 18:50
Oh, that's something I didn't know, thank you! Does the same apply to 64-bit read/writes, assuming a 64-bit architecture? I assume 64 bit data types on a 32 bit architecture is not atomic, same with 128 bit data types on 64 bit architecture. – David Parks Nov 05 '21 at 18:55
1

DDR memory is 64-bits wide, but on a 32-bit processor, the compiler has to use two instructions to write a 64-bit value, because registers are only 32-bits wide. There's a very narrow danger zone between those two instructions. – Tim Roberts Nov 05 '21 at 19:00
1

As I immediately wrote after posting my answer: Let Cunningham's Law work ;-) You are right, I overlooked that there is a shared memory abstraction there since Python 3. But depending how strict you interpret the question, and as you said partly yourself: It is not consistent. Although for many systems 4 byte integers COULD be ok, although I would say it is not when not using volatile, thanks to cache. I wouldn't bet anything important on it. – Superlokkus Nov 05 '21 at 19:10
All modern systems have cache coherency between processors. What you're talking about was a 20th Century problem, and there are some tiny processors where this is an issue, but not anything mainstream. – Tim Roberts Nov 05 '21 at 19:22
C++ can definitely write to buffer atomically. The hardware can do an atomic read (of word-sized, word-aligned read, which is a decent assumption if it was written by C++). The only question is how Python can handle this. Is Program-Order instruction reordering possible in Python? Also, if you're relying on using the counter as a condition instead of just for fun info, that can be a problem. E.g. if you assume some condition is met once you get to count 20, that is a problem because Python would have to issue a load acquire CPU fence instruction on weak arches to preserve causal ordering. – Humphrey Winnebago Nov 05 '21 at 23:07
You're adding distractions that just aren't relevant here. If you were trying to get single-cycle latency, then you'd worry about this, but running a single Python statement involves hundreds or thousands of cycles. Reordering is not relevant. Fences are not relevant. When the value changes, Python will see the change. – Tim Roberts Nov 06 '21 at 00:48
"When the value changes, Python will see the change." Since you're so sure of this, it should be easy for you to provide proof/documentation? Such a guarantee would be enough for the OP's goal of "eventual consistency". I understand that you're saying this is the way it *should* work in order for things to avoid going to hell in a handbasket, but consider me a doubting thomas who wants to see the real deal. I hope you understand that I'm coming from a C++ perspective where making assumptions can crash the plane. – Humphrey Winnebago Nov 06 '21 at 07:00
You're mixing two concepts here. In C/C++, you would need to use `volatile` to make sure the compiler doesn't optimize away the fetch. Python simply can't optimize to that level because it is interpreted. The code is just not low-level enough to make that possible or practical. At the hardware level, which is what I was addressing, coherency is guaranteed. It's the same physical page. – Tim Roberts Nov 06 '21 at 18:38

Can a lock-free atomic write / consistent read operation be achieved on a 4 byte int using System V shared memory across language platforms?

3 Answers3