Akka single point of failure

Question

I want to create a system that will not have a single point of failure. I was under the impression that routers are the tool for doing that but I'm not sure it works as I would expect. This is the entry point of my program :

object Main extends App{
  val system = ActorSystem("mySys", ConfigFactory.load("application"))
  val router = system.actorOf(
    ClusterRouterPool(RoundRobinPool(0), ClusterRouterPoolSettings(
      totalInstances = 2, maxInstancesPerNode = 1,
      allowLocalRoutees = false, useRole = Some("testActor"))).props(Props[TestActor]),
    name = "testActors")
}

And this is the code for running the remote ActorSystem(so the router could deploy the TestActor code to the remote nodes):

object TestActor extends App{
  val system = ActorSystem("mySys", ConfigFactory.load("application").getConfig("testactor1"))
  case object PrintRouterPath
}

I'm running this twice, once with testactor1 and once with testactor2.

TestActor code:

class TestActor extends Actor with ActorLogging{
  implicit val ExecutionContext = context.dispatcher
  context.system.scheduler.schedule(10000 milliseconds, 30000 milliseconds,self, PrintRouterPath)

  override def receive: Receive = {
    case PrintRouterPath =>
     log.info(s"router is on path ${context.parent}")
  }
}

And application.conf

akka{
actor {
  provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
  log-remote-lifecycle-events = off
  netty.tcp {
    hostname = "127.0.0.1"
    port = 2552
  }
}
cluster {
  seed-nodes = [
    "akka.tcp://mySys@127.0.0.1:2552"
    "akka.tcp://mySys@127.0.0.1:2553"
    "akka.tcp://mySys@127.0.0.1:2554"]
  auto-down-unreachable-after = 20s
  }
}
testactor1{
  akka{
    actor {
      provider = "akka.cluster.ClusterActorRefProvider"
    }
    remote {
      log-remote-lifecycle-events = off
      netty.tcp {
        hostname = "127.0.0.1"
        port = 2554
      }
    }
    cluster {
    roles.1 = "testActor"
      seed-nodes = [
        "akka.tcp://mySys@127.0.0.1:2552"
        "akka.tcp://mySys@127.0.0.1:2553"
        "akka.tcp://mySys@127.0.0.1:2554"]
      auto-down-unreachable-after = 20s
    }
  }
}
testactor2{
  akka{
    actor {
      provider = "akka.cluster.ClusterActorRefProvider"
    }
    remote {
      log-remote-lifecycle-events = off
      netty.tcp {
        hostname = "127.0.0.1"
        port = 2553
      }
    }
    cluster {
    roles.1 = "testActor"
      seed-nodes = [
        "akka.tcp://mySys@127.0.0.1:2552"
        "akka.tcp://mySys@127.0.0.1:2553"
        "akka.tcp://mySys@127.0.0.1:2554"]
      auto-down-unreachable-after = 20s
    }
  }
}

Now the problem is that when the process that started the router is killed, the actors that are running the code of TestActor, are not receiving any messages(the messages that the scheduler sends), I would have expect that the router will be deployed on another seed node in the cluster and the actors will be recovered. Is this possible? or is there any other way of implementing this flow and not having a single point of failure?

Stefano Bonetti · Answer 1 · 2017-01-26T18:04:18.047

I think that, by deploying the router on only one node you are setting up a master-slave cluster, where the master is a single point of failure by definition.

From what I understand (looking at the docs), a router can be cluster-aware in the sense that it can deploy (pool mode) or lookup (group mode) routees on nodes in the cluster. The router itself will not react to failure by spawning somewhere else in the cluster.

I believe you have 2 options:

make use of multiple routers to make you system more fault-tolerant. Routees can either be shared (group mode) or not (pool mode) between routers.
make use of the Cluster Singleton pattern - which allows for a master-slave configuration where the master will be automatically re-spawned in case of failure. In relation to your example, note that this behaviour is achieved by having an actor (ClusterSingletonManager) deployed in each node. This actor has the purpose of working out if the chosen master needs to be respawned and where. None of this logic is in place in case of cluster-aware router like the one you setup.

You can find examples of multiple cluster setups in this Activator sample.

1) lets say I have two nodes running testActor then you suggest to start the router on each one of them (group to have exactly the same two instances on each router). now how will I use the router? I mean what will be the purpose of using it? if I want to send a broadcast message to the routees I will either send a message to one of the nodes containing the router(and that node can be unavailable) or send to all of them and then get multiple message handling. Am I missing something? 2) If I use `ClusterSingletonManager` wouldn't that mean that I can't start two actors with `TestActor`? — user_s, Jan 29 '17 at 08:31

score 0 · Answer 2 · answered Jan 30 '17 at 20:34

i tested two approaches, first using your code with ClusterRouterPool Like you said when the process that started the router is killed, TestActor not receive more messages. While reading the documentation and testing , if you change in application.conf :

`auto-down-unreachable-after = 20s`

for this

`auto-down-unreachable-after = off`

the TestActor keep receiving the messages, although in the log the following message appears(i don`t know how to put the log here, sorry):

[WARN] [01/30/2017 17:20:26.017] [mySys-akka.remote.default-remote-dispatcher-5] [akka.tcp://mySys@127.0.0.1:2554/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FmySys%40127.0.0.1%3A2552-0] Association with remote system [akka.tcp://mySys@127.0.0.1:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://mySys@127.0.0.1:2552]] Caused by: [Connection refused: /127.0.0.1:2552] [INFO] [01/30/2017 17:20:29.860] [mySys-akka.actor.default-dispatcher-4] [akka.tcp://mySys@127.0.0.1:2554/remote/akka.tcp/mySys@127.0.0.1:2552/user/testActors/c1] router is on path Actor[akka.tcp://mySys@127.0.0.1:2552/user/testActors#-1120251475] [WARN] [01/30/2017 17:20:32.016] [mySys-akka.remote.default-remote-dispatcher-5]

And in the case the MainApp is restarted the log works normally without warning or errors

MainApp Log :

[INFO] [01/30/2017 17:23:32.756] [mySys-akka.actor.default-dispatcher-2] [akka.cluster.Cluster(akka://mySys)] Cluster Node [akka.tcp://mySys@127.0.0.1:2552] - Welcome from [akka.tcp://mySys@127.0.0.1:2554]

TestActor Log:

INFO] [01/30/2017 17:23:21.958] [mySys-akka.actor.default-dispatcher-14] [akka.cluster.Cluster(akka://mySys)] Cluster Node [akka.tcp://mySys@127.0.0.1:2554] - New incarnation of existing member [Member(address = akka.tcp://mySys@127.0.0.1:2552, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join. [INFO] [01/30/2017 17:23:21.959] [mySys-akka.actor.default-dispatcher-14] [akka.cluster.Cluster(akka://mySys)] Cluster Node [akka.tcp://mySys@127.0.0.1:2554] - Marking unreachable node [akka.tcp://mySys@127.0.0.1:2552] as [Down] [INFO] [01/30/2017 17:23:22.454] [mySys-akka.actor.default-dispatcher-2] [akka.cluster.Cluster(akka://mySys)] Cluster Node [akka.tcp://mySys@127.0.0.1:2554] - Leader can perform its duties again [INFO] [01/30/2017 17:23:22.461] [mySys-akka.actor.default-dispatcher-2] [akka.cluster.Cluster(akka://mySys)] Cluster Node [akka.tcp://mySys@127.0.0.1:2554] - Leader is removing unreachable node [akka.tcp://mySys@127.0.0.1:2552] [INFO] [01/30/2017 17:23:32.728] [mySys-akka.actor.default-dispatcher-4] [akka.cluster.Cluster(akka://mySys)] Cluster Node [akka.tcp://mySys@127.0.0.1:2554] - Node [akka.tcp://mySys@127.0.0.1:2552] is JOINING, roles [] [INFO] [01/30/2017 17:23:33.457] [mySys-akka.actor.default-dispatcher-14] [akka.cluster.Cluster(akka://mySys)] Cluster Node [akka.tcp://mySys@127.0.0.1:2554] - Leader is moving node [akka.tcp://mySys@127.0.0.1:2552] to [Up] [INFO] [01/30/2017 17:23:37.925] [mySys-akka.actor.default-dispatcher-19] [akka.tcp://mySys@127.0.0.1:2554/remote/akka.tcp/mySys@127.0.0.1:2552/user/testActors/c1] router is on path Actor[akka.tcp://mySys@127.0.0.1:2552/user/testActors#-630150507]

The other approach is to use ClusterRouterGroup, because the routees are shared among the nodes of the cluster

Group - router that sends messages to the specified path using actor selection The routees can be shared among routers running on different nodes in the cluster. One example of a use case for this type of router is a service running on some backend nodes in the cluster and used by routers running on front-end nodes in the cluster.
Pool - router that creates routees as child actors and deploys them on remote nodes. Each router will have its own routee instances. For example, if you start a router on 3 nodes in a 10-node cluster, you will have 30 routees in total if the router is configured to use one instance per node. The routees created by the different routers will not be shared among the routers. One example of a use case for this type of router is a single master that coordinates jobs and delegates the actual work to routees running on other nodes in the cluster.

The Main App

object Main extends App {

  val system = ActorSystem("mySys", ConfigFactory.load("application.conf"))
  val routerGroup = system.actorOf(
ClusterRouterGroup(RoundRobinGroup(Nil), ClusterRouterGroupSettings(
  totalInstances = 2, routeesPaths = List("/user/testActor"),
  allowLocalRoutees = false, useRole = Some("testActor"))).props(),
name = "testActors")
}

you must start the TestActor in each remote node

object TestActor extends App{
  val system = ActorSystem("mySys", ConfigFactory.load("application").getConfig("testactor1"))
  system.actorOf(Props[TestActor],"testActor")
  case object PrintRouterPath
}

http://doc.akka.io/docs/akka/2.4/scala/cluster-usage.html#Router_with_Group_of_Routees

The routee actors should be started as early as possible when starting the actor system, because the router will try to use them as soon as the member status is changed to 'Up'.

I hope it helps you

Akka single point of failure

2 Answers2