How does shared memory vs message passing handle large data structures?

Question

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.

This approach obviously alleviates the need for complex locks because there is no shared state.

However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.

My questions:

Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?

rvirding · Answer 1 · 2009-11-26T02:24:05.930

One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.

The BEAM uses copying, except for large binaries where it sends a reference.

score 28 · Accepted Answer · edited Aug 03 '16 at 11:59

28

Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.

in short.... get used to message passing and server processes, it's all the rage.

Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:

share memory by communicating, don't communicate by sharing memory.

the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.

the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)

edited Aug 03 '16 at 11:59

answered Nov 25 '09 at 17:26

Javier

60,510
8
78
126

"if it's 'mostly read-only' then you need a lock". Not true. Overwriting a mutable reference to an immutable data structure is inherently atomic, for example. – J D Feb 10 '11 at 14:28
2

@Jon Harrop: no, unfortunately it's not inherently atomic. In multi-core systems the mutable reference could be cached by different cores, so if you don't use some mechanism (like memory barriers) to ensure some partial ordering of access, you end up with inconsistent behavior. Just writing a pointer is not enough. What you need to avoid locks is a truly lock-free algorithm. The basic trick is effectively atomic pointer replacement; but it has to be done right, not rely on "it's a single instruction, so it's atomic" myths – Javier Feb 10 '11 at 19:04
@Javier: With the memory models of all major architectures, the reader can see only either the old or the new version of the immutable data structure. That's why these memory models were chosen and it is why the ECMA C# specification mandates the behaviour I described. – J D Feb 10 '11 at 21:18
@Jon Harrop: is there any guarantee that _no_ reader will see the new version before the (no-barrier) pointer replacement and _all_ of them see it afterwards? – Javier Feb 12 '11 at 04:39
@Javier: Yes. The pointer write goes asynchronously to main memory and the invalidation of that cache line goes asynchronously to the other caches. They continue to read the old version until the cache line is invalidated whereupon they refetch the cache line and then observe the new version. No reader can see the new version before it has been written and the invalidation will eventually affect all of the readers. The main practical application is snooping the writer without slowing it down, e.g. to visualize the results of a worker thread. – J D Feb 12 '11 at 08:50
Without the atomicity of the write, a reader could observe a partially written pointer (e.g. low bits of the old and high bits of the new), which would be a disaster. You couldn't even built a memory safe VM like the JVM or CLR without locking on every write! – J D Feb 12 '11 at 08:58
atomicity is one thing, and usually guaranteed up to a size (not all architectures go up to pointer-size, but all 'big' ones do); and no-reordering is another. it's quite common that a separate thread could see the pointer replacement at some point of time, and the last settings to the new versions content some time after that. – Javier Feb 12 '11 at 12:14
"it's quite common". The specifications explicitly forbid it on x86, x64, ARM and the CLI. Notable architectures that tried weaker memory models where that could happen (you would need to insert a write-write barrier) include the DEC Alpha and Intel Itanium but I would not call them common. Reordering is only commonly a problem when you wish to have multiple writes to different memory locations appear to occur in a specific order but that is because the *reads* get reordered. – J D Feb 12 '11 at 21:13
interesting. i haven't stopped to check if writes are guaranteed to be ordered, but disordered reads are enough to spoil any hope of simplistic schemes. write barriers are a must. that's why all lock-free algorithms (one of my hobbies) need some memory barriers (`CAS` for x86 and derivates) – Javier Feb 13 '11 at 02:26
1

*while a lock has to be flushed from cache on ALL cores* there are ways to avoid that with some "clever" techniques namely flat combining which seems to be gaining popularity: http://mcg.cs.tau.ac.il/papers/spaa2011-fc-numa-locks.pdf Morealso all message passing stuff does require shared memory, it just copies portion of memory. – bestsss Jun 17 '12 at 19:55
@Javier "The Go style is to pass a message with the reference" where is this "reference" requirement coming from ? There is similar question I have posted about this very basics, your input would be much appreciated. https://stackoverflow.com/questions/36391421/explain-dont-communicate-by-sharing-memory-share-memory-by-communicating – honzajde Aug 27 '17 at 08:11
We’re [contemplating that](https://github.com/keean/zenscript/issues/41#issuecomment-406995325) shared state will be slower as we scale multi-core. – Shelby Moore III Jul 27 '18 at 08:52

score 15 · Answer 3 · answered Nov 26 '09 at 16:15

In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.

In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.

score 13 · Answer 4 · answered Nov 25 '09 at 18:17

13

Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.

So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.

Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.

answered Nov 25 '09 at 18:17

Greg Rogers

35,641
17
67
94

1

It finally clicked for me why Go strings are immutable. Thanks! – Billy Jo Nov 26 '09 at 04:06
1

Not really - strings are immutable in Java, C# (and thus the whole .NET) and Python too, and one good reason is support for string literals (which are immutable even in C - depending on the implementation, a program might segfault when writing to a string literal). Another important reason is that if they were mutable, they couldn't be used as hashtables keys.I think this is what makes strings immutable in most modern programming languages. (see http://docs.python.org/library/stdtypes.html#typesseq-mutable to confirm this for Python). – Blaisorblade Aug 04 '10 at 23:19
@Blaisorblade I would presume the main reason that strings be immutable so that they can be passed-by-reference instead of copied everywhere they’re referenced. – Shelby Moore III Jul 24 '18 at 03:42
We’re also [contemplating that](https://github.com/keean/zenscript/issues/41#issuecomment-406995325) shared state will be slower as we scale multi-core for the reasons you stated and more. – Shelby Moore III Jul 27 '18 at 08:55

score 5 · Answer 5 · edited Aug 03 '16 at 12:05

What is a large data structure?

One persons large is another persons small.

Last week I talked to two people - one person was making embedded devices he used the word "large" - I asked him what it meant - he say over 256 KBytes - later in the same week a guy was talking about media distribution - he used the word "large" I asked him what he meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes

In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.

Really large data structures (i.e. ones that won't fit on one machine) have to be copied and "striped" over several machines.

So I guess you have the following:

Small data structures are no problem - since they are small data processing times are fast, copying is fast and so on (just because they are small)

Big data structures are a problem - because they don't fit on one machine - so copying is essential.

Huge data structures have to be striped. However, each stripe is still big; and for any big data structure, you still want just a copy for each host (and that's what BEAM does, as discussed above). — Blaisorblade, Aug 04 '10 at 23:52

score 4 · Answer 6 · edited Aug 03 '16 at 12:01

Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).

Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?

Using shared state will be a lot faster.

How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?

Either approach can be used.

Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?

Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.

Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.

Note that your response is appreciated; obviously I am talking about sharing state vs copying. — wsorenson, Mar 09 '11 at 15:57

score 3 · Answer 7 · answered Nov 26 '09 at 11:43

One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.

This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)

A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".

score 3 · Answer 8 · edited Jan 26 '18 at 18:49

The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).

Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.

If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.

Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.

As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.

I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.

score 1 · Answer 9 · answered Nov 30 '09 at 02:47

Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.

score 0 · Answer 10 · answered Dec 30 '09 at 15:50

The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms

http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next

http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

How does shared memory vs message passing handle large data structures?

10 Answers10

Linked