In Erlang, when a process's mailbox growth bigger, it runs slower, why?

Question

Here is the example:test_for_gen_server.erl

When a process got 10000 messages in its mailbox, it took 0.043 sec to finish. When the number is 50000, it should take 0.215 sec, but the reality is 2.4 sec, 10 times slower. Why?

Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:true]

Eshell V7.1 (abort with ^G)

1> test_for_gen_server:start_link().

{ok,<0.36.0>}

2> test_for_gen_server:test(10000).

ok

======gen_server: Times:10000 Cost:42863

3> test_for_gen_server:test(10000).

ok

======gen_server: Times:10000 Cost:43096

4> test_for_gen_server:test(10000).

ok

======gen_server: Times:10000 Cost:43223

5> test_for_gen_server:test(50000).

ok

======gen_server: Times:50000 Cost:2504395

6> test_for_gen_server:test(50000).

ok

======gen_server: Times:50000 Cost:2361987

7> test_for_gen_server:test(50000).

ok

======gen_server: Times:50000 Cost:2304715

Greg · Accepted Answer · 2016-03-25T22:32:37.617

Following the comment, in this case indeed it's not caused by the size of the mailbox, because in the gen_server the mailbox message always matches. See the receive loop.

It turns out that in this case the slower execution is due to additional complexity of the code, and in particular, to multiple allocations of small amounts of data that then need to be freed by the garbage collector (so is rather unrelated to the size of the mailbox but to how many times the code gets executed).

Below is a slightly modified version of your code, the main difference is that the message queue is filled up after receiving the start message. In addition to your example there are 7 other variations, each one slightly modified/simplified version of your loop. The second loop is based on the flow that you can find the gen_server code.

-module (test_for_gen_server).

-behaviour (gen_server).

%% APIs
-export([test1/1, test2/1, test3/1, test4/1, test5/1, test6/1, test7/1,
         test8/1, test9/1]).

%% gen_server callbacks
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
         terminate/2, code_change/3]).

test1(N) ->
    {ok, Pid} = gen_server:start_link(?MODULE, [], []),
    Pid ! {start, N}.

test2(N) -> Pid = spawn(fun() -> loop2([undefined, 0]) end), Pid ! {start, N}.
test3(N) -> Pid = spawn(fun() -> loop3([undefined, 0]) end), Pid ! {start, N}.
test4(N) -> Pid = spawn(fun() -> loop4([undefined, 0]) end), Pid ! {start, N}.
test5(N) -> Pid = spawn(fun() -> loop5([undefined, 0]) end), Pid ! {start, N}.
test6(N) -> Pid = spawn(fun() -> loop6([undefined, 0]) end), Pid ! {start, N}.
test7(N) -> Pid = spawn(fun() -> loop7([undefined, 0]) end), Pid ! {start, N}.
test8(N) -> Pid = spawn(fun() -> loop8(undefined, 0) end), Pid ! {start, N}.
test9(N) -> Pid = spawn(fun() -> loop9({undefined, 0}) end), Pid ! {start, N}.

%%==============================================================================

init([]) ->
    {ok, []}.
handle_call(_Request, _From, State) ->
    {reply, nomatch, State}.
handle_cast(_Msg, State) ->
    {noreply, State}.

handle_info({start, N}, _State) ->
    do_test(N),
    {A,B,C} = os:timestamp(),
    Timestamp = (A * 1000000 + B) * 1000000 + C,
    {noreply, [Timestamp, 0]};
handle_info(stop, [Timestamp, Times]) ->
    {A,B,C} = os:timestamp(),
    Timestamp1 = (A * 1000000 + B) * 1000000 + C,
    Cost = Timestamp1 - Timestamp,
    io:format("======gen_server:  Times:~p Cost:~p~n", [Times, Cost]),
    {stop, normal, []};
handle_info(_Info, [Timestamp, Times]) ->
    {noreply, [Timestamp, Times + 1]}.

terminate(_Reason, _State) -> ok.

code_change(_OldVer, State, _Extra) -> {ok, State}.

do_test(0) -> self() ! stop;
do_test(N) -> self() ! a, do_test(N - 1).

%%==============================================================================

loop2(State) ->
    Msg = receive
              Input -> Input
          end,
    Reply = {ok, handle_info(Msg, State)},
    handle_common_reply(Reply, Msg, State).

handle_common_reply(Reply, _Msg, _State) ->
    case Reply of
        {ok, {noreply, NState}} -> loop2(NState);
        {ok, {stop, normal, _}} -> ok
    end.

%%==============================================================================

loop3(State) ->
    Msg = receive
              Input -> Input
          end,
    Reply = {ok, handle_info(Msg, State)},
    case Reply of
        {ok, {noreply, NState}} -> loop3(NState);
        {ok, {stop, normal, _}} -> ok
    end.

%%==============================================================================

loop4(State) ->
    Msg = receive
              Input -> Input
          end,
    case handle_info(Msg, State) of
        {noreply, NState} -> loop4(NState);
        {stop, normal, _} -> ok
    end.

%%==============================================================================

loop5(State) ->
    receive
        Input ->
            case handle_info(Input, State) of
                {noreply, NState} -> loop5(NState);
                {stop, normal, _} -> ok
            end
    end.

%%==============================================================================

loop6(State) ->
    receive
        {start, _N} = Msg ->
            {noreply, NState} = handle_info(Msg, State),
            loop6(NState);
        stop = Msg ->
            {stop, normal, []} = handle_info(Msg, State);
        Info ->
            {noreply, NState} = handle_info(Info, State),
            loop6(NState)
    end.

%%==============================================================================

loop7([Timestamp, Times]) ->
    receive
        {start, N} ->
            do_test(N),
            {A,B,C} = os:timestamp(),
            NTimestamp = (A * 1000000 + B) * 1000000 + C,
            loop7([NTimestamp, 0]);
        stop ->
            {A,B,C} = os:timestamp(),
            NTimestamp = (A * 1000000 + B) * 1000000 + C,
            Cost = NTimestamp - Timestamp,
            io:format("======Times:~p Cost:~p~n", [Times, Cost]);
        _Info ->
            loop7([Timestamp, Times + 1])
    end.

%%==============================================================================

loop8(Timestamp, Times) ->
    receive
        {start, N} ->
            do_test(N),
            {A,B,C} = os:timestamp(),
            NTimestamp = (A * 1000000 + B) * 1000000 + C,
            loop8(NTimestamp, 0);
        stop ->
            {A,B,C} = os:timestamp(),
            NTimestamp = (A * 1000000 + B) * 1000000 + C,
            Cost = NTimestamp - Timestamp,
            io:format("======Times:~p Cost:~p~n", [Times, Cost]);
        _Info ->
            loop8(Timestamp, Times + 1)
    end.

%%==============================================================================

loop9({Timestamp, Times}) ->
    receive
        {start, N} ->
            do_test(N),
            {A,B,C} = os:timestamp(),
            NTimestamp = (A * 1000000 + B) * 1000000 + C,
            loop9({NTimestamp, 0});
        stop ->
            {A,B,C} = os:timestamp(),
            NTimestamp = (A * 1000000 + B) * 1000000 + C,
            Cost = NTimestamp - Timestamp,
            io:format("======Times:~p Cost:~p~n", [Times, Cost]);
        _Info ->
            loop9({Timestamp, Times + 1})
    end.

Result:

28> c(test_for_gen_server).          
{ok,test_for_gen_server}
29> test_for_gen_server:test1(50000).
{start,50000}
======gen_server:  Times:50000 Cost:2285054

30> test_for_gen_server:test2(50000).
{start,50000}
======gen_server:  Times:50000 Cost:2170294

31> test_for_gen_server:test3(50000).
{start,50000}
======gen_server:  Times:50000 Cost:1520796

32> test_for_gen_server:test4(50000).
{start,50000}
======gen_server:  Times:50000 Cost:1526084

33> test_for_gen_server:test5(50000).
{start,50000}
======gen_server:  Times:50000 Cost:1510738

34> test_for_gen_server:test6(50000).
{start,50000}
======gen_server:  Times:50000 Cost:1496024

35> test_for_gen_server:test7(50000).
{start,50000}
======Times:50000 Cost:863876

36> test_for_gen_server:test8(50000).
{start,50000}
======Times:50000 Cost:5830

47> test_for_gen_server:test9(50000).
{start,50000}
======Times:50000 Cost:640157

You can see how the execution time gets smaller and smaller with every change. Notice the difference between test2 and test3, where the only difference in the code is the additional function call. And especially pay attention to the dramatic difference between test7 and test8, where the only difference in the code is the additional creation and destruction of a two-element list with each execution of the loop in case of test7.

The last loop can be executed without allocating anything on the stack, using only the VM virtual registers, so will be the fastest. The other loops always allocate some data on the stack, which then has to be periodically freed by the garbage collector.

Note

Just added test9 to show that using tuples instead of lists when passing arguments between functions generally gives a better performance.

Previous answer left for reference

This is because the receive clause needs to match incoming messages with a pattern that can occur in that clause. It takes each message from the mailbox and tries to match it against the pattern. The first one that matches is processed.

So, if the queue builds up because the messages don't match it takes longer and longer to process each new incoming message (because the matching always start from the first message in the queue).

Therefore it's a good practice to always flush unknown messages in gen servers, as suggested by Joe Armstrong in his Phd thesis (section 5.8).

This article explains it with more details: Erlang explained: Selective receive, and it's also explained in section 3.4 of Joe Armstrong thesis mentioned earlier.

But all the messages match the third pattern of handle_info: `handle_info(_Info, [Timestamp, Times])`, your answer was also my first guess, but i don't think it is 100% relevant in this case. — Pascal, Mar 25 '16 at 14:36
Hi Amiramix, nice to meet you. I was wondering have you knew this issue before, or you just do the test after I ask the question? How do you find out a reason when you face a problem like this? — HooHoo Diddy, Mar 28 '16 at 02:14
I jumped at the conclusion that it was caused by the inbox selective receive, however something didn't sound right. You pointed that out, so I inspected `gen_server` code. And indeed, there was no selective receive. As the answer was already upvoted I didn't want to delete it. I wrote loop 8, which was very fast, which got me curious, because the `gen_server` wasn't much different. I then thought of calling `handle_info` from that loop, and the performance dropped. So I tried to modify the code to look for the cause. With problems like this I usually start with looking into Erlang source code. — Greg, Mar 28 '16 at 03:26
Thanks,Amiramix. But I still have some questions. Why list's GC is more than tuple? Then I did some test, found out map's performance is worse than record. Code is [here](https://github.com/wudixiaotie/simple_im/blob/master/test/test_for_gc.erl), Why R17 even suggest us to use map instead of record? — HooHoo Diddy, Apr 07 '16 at 08:14
GC has to do more work the more wasteful the data structure is. The wastefulness in this case is the ratio between the memory allocated to the data held in the data structure and the memory allocated to the data structure itself. Tuple is the simplest - it's like an array in C - it only holds the size and type of each element. List is a linked list - each element apart from the type also holds a pointer to the next element in the list. Map is most wasteful because of the nature of being a hash table it needs to pre-allocate a set amount of slots https://en.wikipedia.org/wiki/Hash_table — Greg, Apr 07 '16 at 08:55
... continuing, for most cases it doesn't matter because processes usually have enough pre-allocated memory to complete the task, and when the process dies the memory is freed up instantaneously without GC running. For long running processes, like `gen_server`, I think the recommended way is still to use records rather than maps. Maps should be used where it makes sense, e.g. keys are not known in advance. It's better to use maps and sacrifice performance in exchange for simpler and more maintainable code, than use super-performant records in a code that's difficult to read and full of bugs. — Greg, Apr 07 '16 at 09:02
... continuing, here is my own take at measuring the time of reading and writing 30,000 elements randomly to various Erlang data structures and how much memory the resulting data structure consumes. Maps take up a lot of memory but they are pretty quick (when compared to other key-value based data stores): https://gist.github.com/amiramix/d43c9a73a6fe6d651d7f — Greg, Apr 07 '16 at 09:06
Thanks, and I want to apologize for reply so late, because your reply was fold in "show 3 more comments", so I didn't notice. Hahaha, this is my first question, I didn't know there is a function like this. — HooHoo Diddy, Apr 25 '16 at 07:20

score 3 · Answer 2 · answered Dec 12 '16 at 03:32

Recently, I've got the same problem in Elixir, Finally I find the answer in this paper

In section 3. The memory architecture of erlang is process-centric. Each process allocates and manages its own memory area which typically includes a PCB, private stack, and private heap.

This cause the disadvantage: High memory fragmentation

A process cannot utilize the memory (e.g., the heap) of another process even if there are large amounts of unused space in that memory area. This typically implies that processes can allocate only a small amount of memory by default. This in turn usually results in a larger number of calls to the garbage collector.

So large size of mailbox can cause the whole system slow.

sizif · Answer 3 · 2020-01-22T09:44:35.447

Adding to 蛋汤饭's highly useful reply:

as of Erlang/OTP 20 one can use process flag message_queue_data = off_heap to avoid GC slowdown on large mailboxes. See the documentation for process_flag(Flag :: message_queue_data, MQD) here. See an example in Erlang standard library:

init(_) ->
    %% The error logger process may receive a huge amount of
    %% messages. Make sure that they are stored off heap to
    %% avoid exessive GCs.
    process_flag(message_queue_data, off_heap),
    {ok, []}.

Alternatively, you can use {message_queue_data, off_heap} option when creating a process with erlang:spawn_opt/2.

However in my experience, the off_heap option does not help much. What helps is a custom flow control scheme: let the sender process check queue length of the receiver process with process_info(ReceiverPid, message_queue_len) and throttle if the queue length exceeds 1000. Check periodically, not on every send to avoid killing performance. See also here.

In Erlang, when a process's mailbox growth bigger, it runs slower, why?

3 Answers3

Linked