NUMA awareness of JVM

Question

My question concerns the extent to which a JVM application can exploit the NUMA layout of a host.

I have an Akka application in which actors concurrently process requests by combining incoming data with 'common' data already loaded into an immutable (Scala) object. The application scales well in the cloud, using many dual core VMs, but performs poorly on a single 64 core machine. I presume this is because the common data object resides in one NUMA cell and many threads concurrently accessing from other cells is too much for the interconnects.

If I run 64 separate JVM applications each containing 1 actor then performance is is good again. A more moderate approach might be to run as many JVM applications as there are NUMA cells (8 in my case), giving the host OS a chance to keep the threads and memory together?

But is there a smarter way to achieve the same effect within a single JVM? E.g. if I replaced my common data object with several instances of a case class, would the JVM have the capability to place them on the optimal NUMA cell?

Update:

I'm using Oracle JDK 1.7.0_05, and Akka 2.1.4

I've now tried with the UseNUMA and UseParallelGC JVM options. Neither seemed to have any significant impact on slow performance when using one or few JVMs. I've also tried using a PinnedDispatcher and the thre-pool-executor with no effect. I'm not sure if the configuration is having an effect though, since there seems nothing different in the startup logs.

The biggest improvement remains when I use a single JVM per worker (~50). However, the problem with this appears to be that there is a long delay (up to a couple of min) before the FailureDector registers the successful exchange of 'first heartbeat' between Akka cluster JVMs. I suspect there is some other issue here that I've not yet uncovered. I already had to increase the ulimit -u since I was hitting the default maximum number of processes (1024).

Just to clarify, I'm not trying to achieve large numbers of messages, just trying to have lots of separate actors concurrently access an immutable object.

Also, what GC settings do you use? And what executor config? — Viktor Klang, May 29 '13 at 00:37
You probably need to tell akka to use better threading patterns, see here for some mail box configuration options: http://doc.akka.io/docs/akka/snapshot/scala/dispatchers.html — Noah, May 29 '13 at 03:25
This kind of thing is very dependent on the VM you are using, I assume you are using Oracle's HotSpot? — rxg, May 29 '13 at 07:43
I would add the 'akka' tag. Possible guys from Typesafe will share experience how to tune Akka for servers with such number of CPUs: http://letitcrash.com/post/20397701710/50-million-messages-per-second-on-a-single-machine — Andriy Plokhotnyuk, May 29 '13 at 11:04
Hotspot has the UseNUMA flag, and that's pretty much it about NUMA support. This partitions the eden, and interleave the old gen on each node using a round robin algo.(http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html). So the JVM cannot be smart about old objects. I believe your 8 JVM approach sounds nice, why don't you bind each of those to a node? (using numactl on Linux). — Alexandre de Champeaux, Jun 10 '13 at 17:14

Andriy Plokhotnyuk · Answer 1 · 2014-11-15T14:55:37.703

I think if you sure that problems not in message processing algorithms then you should take in account not only NUMA option but whole env. configuration, starting from JVM version (latest is better, Oracle JDK also mostly performs better than OpenJDK) then JVM options (including GC, memory, concurrency options etc.) then Scala and Akka versions (latest release candidates and milestones can be much better) and also Akka configuration.

From here you can borrow all things that matter to got 50M messages per second of total throughput for Akka actors on contemporary laptops.

Never had chance to run these benchmarks on 64-core server - so any feedback will be greatly appreciated.

From my findings, which can help, current implementations of ForkJoinPool increases message send latency when number of threads in pool increases. It is greatly noticeable for cases when rate of response-request call between actors is high, e. g. on my laptop when increasing pool size from 4 to 64 message send latency of Akka actors for such cases grows up to 2-3x times for most executor services (Scala's ForkJoinPool, JDK's ForkJoinPool, ThreadPoolExecutor).

You can check if there are any differences by running mvnAll.sh with the benchmark.parallelism system variable set to different values.

Here's a blog post describing scalability profile of akka on our 48 core test server using FJP: http://letitcrash.com/post/20397701710/50-million-messages-per-second-on-a-single-machine — Viktor Klang, May 29 '13 at 14:40

NUMA awareness of JVM

1 Answers1