Parallel counting using a functional approach and immutable data structures?

Question

I have heard and bought the argument that mutation and state is bad for concurrency. But I struggle to understand what the correct alternatives actually are?

For example, when looking at the simplest of all tasks: counting, e.g. word counting in a large corpus of documents. Accessing and parsing the document takes a while so we want to do it in parallel using k threads or actors or whatever the abstraction for parallelism is.

What would be the correct but also practical pure functional way, using immutable data structures to do this?

score 2 · Answer 1 · answered Jun 27 '18 at 16:20

The general approach in analyzing data sets in a functional way is to partition the data set in some way that makes sense, for a document you might cut it up into sections based on size. i.e. four threads means the doc is sectioned into four pieces.

The thread or process then executes its algorithm on each section of the data set and generates an output. All the outputs are gathered together and then merged. For word counts, for example, a collection of word counts are sorted by the word, and then each list is stepped through using looking for the same words. If that word occurs in more than one list, the counts are summed. In the end, a new list with the sums of all the words is output.

This approach is commonly referred to as map/reduce. The step of converting a document into word counts is a "map" and the aggregation of the outputs is a "reduce".

In addition to the advantage of eliminating the overhead to prevent data conflicts, a functional approach enables the compiler to optimize to a faster approach. Not all languages and compilers do this, but because a compiler knows its variables are not going to be modified by an outside agent it can apply transforms to the code to increase its performance.

In addition, functional programming lets systems like Spark to dynamically create threads because the boundaries of change are clearly defined. That's why you can write a single function chain in Spark, and then just throw servers at it without having to change the code. Pure functional languages can do this in a general way making every application intrinsically multi-threaded.

One of the reasons functional programming is "hot" is because of this ability to enable multiprocessing transparently and safely.

OK, so during the "map" phase, each process just process its own portion of data, using its own separate datastructure. For word-counting this could be a hash map, and since the process is independent of others, the hash map would not actually have to be immutable, since only one thread ever updates it, right? But this approach also requires that there must be a way to re-phrase the operation to be carried out in a map-reduce fashion. While this can be done in an obvious way for word counting, I am sure there are other tasks where it is difficult or provably impossible? — jpp1, Jul 12 '18 at 13:56
Yes. Not all problems lend themselves to map-reduce, but its fun to discover clever ways to do it. For example, computing a hash cannot be done in parallel because each computation depends on the previous, but the data can be broken up in blocks and the hashes recorded in a directory which is then hashed. There one increases the speed of creation at the cost of later comparisons. — Grant BlahaErath, Jul 14 '18 at 00:01

score 1 · Answer 2 · answered Jun 25 '18 at 22:52

1

Mutation and state are bad for concurrency only if mutable state is shared between multiple threads for communication, because it's very hard to argue about impure functions and methods that silently trash some shared memory in parallel.

One possible alternative is using message passing for communication between threads/actors (as is done in Akka), and building ("reasonably pure") functional data analysis frameworks like Apache Spark on top of it. Apache Spark is known to be rather suitable for counting words in a large corpus of documents.

answered Jun 25 '18 at 22:52

Andrey Tyukin

43,673
4
57
93

Thank you! I understand that the problem of mutable data structures comes from sharing them, but I am also interested in the more theoretical question of what a purely functional approach could actually look like? If all I have is functions and vals, and I want to implement the "parallel word count" example that way, how would I have to do it? – jpp1 Jun 26 '18 at 16:38

Parallel counting using a functional approach and immutable data structures?

2 Answers2