akka cluster/remoting dead latter on high volume, slower subscriber

Question

Trying to understand what layer of AKKA clustering is sending occasional messages to dead latter when volume increase or all receiving actor are busy doing the work as well as how to tune it to eliminate such behavior.

Here is basic topology: 2 nodes. Node1 consist of set of actors(lets call them publishing actors) and akka cluster aware router. publishing actors publish messages to the router (RoundRobin) that in turn routs messages to Node2 consisting of worker actors(lest call them subscriber actors) that receive message and do some work and ack back to publishing routers.

Observations: under high rate (well not that high for akka, 10K in 10 sec) of published messages and subscriber workers are busy, i see occasional dead latter from both sides (publishing actors and subscriber actors acking back). The rate of dead latter was almost 30-40% but after profiling and noticing thread starvation and configuring separate dispatcher for cluster and PinnedDispatcher for subscriber workers, we were able to reduce dead latter rate to 1-2%. Worth noting here that high rate of dead latter was observed when default dispatcher with for-join thread pool was used and number of actors higher then number of threads; and much lesser rate when number actors less then number of threads leading us to convulsion that fork-join pool is being used by other akka system processing. Ram, GC and CPU looks under control. It is using default unbounded mail box , so can not be related with buffer. As Far as I know akka doe snot manage back pressure

Of course we do understand that akka doe snot gurantee delivery and we have to implement our own retry logic. Main attempt here is to understand what is causing dead latter: is it occurring on akka remoting, netty transport layer..., are there some time out implemented that can be tuned and configured.

I have spend quite a good chunk of time profiling, adding extra logging , capturing dead latter and logging but did not get any clue on actual cause.

Any hints, things to try or additional information is greatly appreciated

Here is the config we added:

cluster-dispatcher {   type = "Dispatcher"   executor = "fork-join-executor"   fork-join-executor {
    parallelism-min = 2
    parallelism-max = 4   } }

#usde by worker worker-pinned-dispatcher {   executor = "thread-pool-executor"   type = PinnedDispatcher }

You're right that actor messaging does not have any flow control so these kinds of problems for systems under load are common. For sending large numbers of remote messages I suggest taking a look at StreamRefs: https://doc.akka.io/docs/akka/current/stream/stream-refs.html or a work pulling pattern: http://www.michaelpollmeier.com/akka-work-pulling-pattern — Christopher Batey, May 31 '18 at 09:01

akka cluster/remoting dead latter on high volume, slower subscriber

0 Answers0