Akka.net / Cluster - How to "Heal" the topology when the leader dies?

Question

I set up a basic test topology with Petabridge Lighthouse and two simple test actors that communicate with each other. This works well so far, but there is one problem: Lighthouse (or the underlying Akka.Cluster) makes one of my actors the leader, and when not shutting the node down gracefully (e.g. when something crashes badly or I simply hit "Stop" in VS) the Lighthouse is not usable any more. Tons of exceptions scroll by and it must be restarted.

Is it possible to configure Akka.Cluster .net in a way that the rest of the topology elects a new leader and carries on?

score 2 · Answer 1 · answered Dec 01 '17 at 14:24

There are 2 things to point here. One is that if you have a serious risk of your lighthouse node going down, you probably should have more that one - akka.cluster.seed-nodes setting can take multiple addresses, the only requirement here is that all nodes, including lighthouses, must have them specified in the same order. This way if one lighthouse is going down, another one still can take its role.

Other thing is that when a node becomes unreachable (either because the process crashed on network connection is unavailable), by default akka.net cluster won't down that node. You need to tell it, how it should behave, when such thing happens:

At any point you can configure your own IDowningProvider interface, that will be triggered after certain period of node inactivity will be reached. Then you can manually decide what to do. To use it add fully qualified type name to followin setting: akka.cluster.downing-provider = "MyNamespace.MyDowningProvider, MyAssembly". Example downing provider implementation can be seen here.
You can specify akka.cluster.auto-down-unreachable-after = 10s (or other time value) to specify some timeout given for an unreachable node to join - if it won't join before the timeout triggers, it will be kicked out from the cluster. Only risk here is when cluster split brain happens: under certain situations a network failure between machines can split your cluster in two, if that happens with auto-down set up, two halves of the cluster may consider each other dead. In this case you could end up having two separate clusters instead of one.
Starting from the next release (Akka.Cluster 1.3.3) a new Split Brain Resolver feature will be available. It will allow you to configure more advanced strategies on how to behave in case of network partitions and machine crashes.

Akka.net / Cluster - How to "Heal" the topology when the leader dies?

1 Answers1