Trying to understand what layer of AKKA clustering is sending occasional messages to dead latter when volume increase or all receiving actor are busy doing the work as well as how to tune it to eliminate such behavior.
Here is basic topology: 2 nodes. Node1 consist of set of actors(lets call them publishing actors) and akka cluster aware router. publishing actors publish messages to the router (RoundRobin) that in turn routs messages to Node2 consisting of worker actors(lest call them subscriber actors) that receive message and do some work and ack back to publishing routers.
Observations: under high rate (well not that high for akka, 10K in 10 sec) of published messages and subscriber workers are busy, i see occasional dead latter from both sides (publishing actors and subscriber actors acking back). The rate of dead latter was almost 30-40% but after profiling and noticing thread starvation and configuring separate dispatcher for cluster and PinnedDispatcher for subscriber workers, we were able to reduce dead latter rate to 1-2%. Worth noting here that high rate of dead latter was observed when default dispatcher with for-join thread pool was used and number of actors higher then number of threads; and much lesser rate when number actors less then number of threads leading us to convulsion that fork-join pool is being used by other akka system processing. Ram, GC and CPU looks under control. It is using default unbounded mail box , so can not be related with buffer. As Far as I know akka doe snot manage back pressure
Of course we do understand that akka doe snot gurantee delivery and we have to implement our own retry logic. Main attempt here is to understand what is causing dead latter: is it occurring on akka remoting, netty transport layer..., are there some time out implemented that can be tuned and configured.
I have spend quite a good chunk of time profiling, adding extra logging , capturing dead latter and logging but did not get any clue on actual cause.
Any hints, things to try or additional information is greatly appreciated
Here is the config we added:
cluster-dispatcher { type = "Dispatcher" executor = "fork-join-executor" fork-join-executor {
parallelism-min = 2
parallelism-max = 4 } }
#usde by worker worker-pinned-dispatcher { executor = "thread-pool-executor" type = PinnedDispatcher }