How to get concurrent function (pmap) to use all cores in Elixir?

Question

I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:

After showing the example in the book, Dave says:

And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.

My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?

What is the first line of iex's output? i.e: "Erlang/OTP 18 [erts-7.3] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]" — Thiago Silveira, Apr 24 '16 at 01:49
@ThiagoSilveira `Erlang/OTP 18 [erts-7.3] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]` — user456584, Apr 24 '16 at 02:14

score 9 · Accepted Answer · answered Apr 24 '16 at 07:55

Great comment by @Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:

iex --erl '-smp disable'

This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.

iex --erl '+S 1:1'

Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:

:erlang.system_info(:schedulers_online)

To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?

The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.

Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.

Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.

Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.

To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.

score 2 · Answer 2 · answered Apr 25 '16 at 22:50

Concurrency is not Parallelism

In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.

This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.

What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.

The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.

How to get concurrent function (pmap) to use all cores in Elixir?

2 Answers2