unicode collation NIF running slower than Pure Erlang implementation

Question

I'm trying to optimise existing unicode collation library(written in Erlang) by rewriting it as a NIF implementation. Prime reason is because collation is CPU intensive operation.

Link to implementation: https://github.com/abhi-bit/merger

Unicode collation of 1M rows via Pure Erlang based priority queue:

erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_couch_skew main 1000000 -s init stop
Queue size: 1000000
12321.649 ms

Unicode collation of 1M rows via NIF based binomial heap:

erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_merger main 1000000 -s init stop
Queue size: 1000000
15871.965 ms

This is unusual, I was expecting it to be probably ~10X faster.

I turned on eprof/fprof but they aren't of much use when it comes to NIF modules, below is what eprof said about the prominent functions

FUNCTION                                         CALLS      %     TIME  [uS / CALLS]
--------                                         -----    ---     ----  [----------]
merger:new/0                                         1   0.00        0  [      0.00]
merger:new/2                                         1   0.00        0  [      0.00]
merger:size/1                                   100002   0.31    19928  [      0.20]
merger:in/3                                     100000   3.29   210620  [      2.11]
erlang:put/2                                   2000000   6.63   424292  [      0.21]
merger:out/1                                    100000  14.35   918834  [      9.19]

I'm sure, NIF implementation could be made faster because I've a pure C implementation of unicode collation based on binary Heap using dynamic array and that's much much faster.

$ make
gcc -I/usr/local/Cellar/icu4c/55.1/include  -L/usr/local/Cellar/icu4c/55.1/lib  min_heap.c collate_json.c kway_merge.c kway_merge_test.c -o output -licui18n -licuuc -licudata
./output
Merging 1 arrays each of size 1000000
mergeKArrays took 84.626ms

Specific questions I've here:

How much slowdown is expected because of Erlang <-> C communication in a NIF module? In this case, slowdown is probably 30x or more between pure C and NIF implementation
What tools could be useful to debug NIF related slowdown(like in this case)? I tried using perf top to see the function call, top ones(some hex addresses were showing) were coming from "beam.smp".
What are possible areas that I should look at optimising a NIF? For example: I've heard that one should keep data being transferred between Erlang to C and vice-versa minimal, are there more such areas to consider?

score 3 · Accepted Answer · answered Jan 02 '16 at 03:46

3

The overhead of calling a NIF is tiny. When the Erlang runtime loads a module that loads a NIF, it patches the module's beam code with an emulator instruction to call into the NIF. The instruction itself performs just a small amount of setup prior to calling the C function implementing the NIF. This is not the area that's causing your performance issues.

Profiling a NIF is much the same as profiling any other C/C++ code. Judging from your Makefile it appears you're developing this code on OS X. On that platform, assuming you have XCode installed, you can use the Instruments application with the CPU Samples instrument to see where your code is spending most of its time. On Linux, you can use the callgrind tool of valgrind together with an Erlang emulator built with valgrind support to measure your code.

What you'll find if you use these tools on your code is, for example, that perf_merger:main/1 spends most of its time in merger_nif_heap_get, which in turn spends a noticeable amount of time in CollateJSON. That function seems to call convertUTF8toUChar and createStringFromJSON quite a bit. Your NIF also seems to perform a lot of memory allocation. These are the areas you should focus on to speed up your code.

answered Jan 02 '16 at 03:46

Steve Vinoski

19,847
3
31
46

Thanks Steve for your suggestions. Trying to get useful information out of Erlang VM with valgrind support. Interestingly on a Linux machine, NIF version appears faster than Erlang. – Abhi Jan 03 '16 at 03:32
Couple of findings from NIF perf runs: (a) Amount of data being transferred between NIF and Erlang(vice-versa) matter a lot(for larger data transfer i.e. 100KB or more per NIF call, performance can be horrible). (b) Context switch between Erlang <-> NIF isn't cheap, 1M calls with 40 bytes data(40MB aggregate) would take longer to sort than 10K 40KB data. – Abhi Jan 04 '16 at 18:14
There's no real "transfer" of data between a NIF and Erlang, in the sense of copying or such. I suspect what you're seeing is the fact that your `merger_nif_heap_put` and `merger_nif_heap_get` functions do a lot of heap allocations and copying, as I originally pointed out. You should investigate alternative ways of storing and retrieving data that don't involve so much allocation and copying. For example, `merger_item_create` takes two `ErlNifBinary` instances, allocates new memory, and copies them there; instead, why not just store them as `ErlNifBinary` instances? – Steve Vinoski Jan 04 '16 at 19:03
Steve, after cleanup w.r.t. mem allocations that could be avoided here is the CPU profile http://imgur.com/wV3oacz of the NIF. From the profile dump(simulating worst case scenario - passing 4KB key & 1MB value), I see majority of time is spent in copying variables from Erlang world to NIF and env_free calls. Any suggestions on what else can be optimised to reduce burden of copying data from Erlang to NIF & also freeing up memory post collation?(updated code sits in github) – Abhi Jan 18 '16 at 13:11
I'll need to clone your repo and take a closer look. I still think you're allocating and copying too much in your put function, but I'll need to take a closer look to know for sure. Any particular command line or shell command you'd recommend for being able to take measurements? – Steve Vinoski Jan 18 '16 at 16:06
Steve, much appreciate your useful responses on this thread so far. Updated the git repo with readme. I usually run couple of `perf_*` programs under `erl_prog` along with Instruments utility on mac. Sorry for polluting this SO thread, I hangout on #erlang as sin8h. – Abhi Jan 18 '16 at 16:51

unicode collation NIF running slower than Pure Erlang implementation

1 Answers1