1

We are running a akka cluster in docker and running in Mesos. The structure is such that 3 different applications (each having 4 instances) talk to each other within the cluster

When we want to do a deployment, we are using Marathon upgrade strategy feature to deploy. The way it is configured that it will create a new node with latest deplyment and then kill one of the old nodes and continue this process till all nodes are up. We are using below cofiguration to achieve the same (for 4 nodes)

"upgradeStrategy": {
    "minimumHealthCapacity": 1,
    "maximumOverCapacity": 0.3
},

Our main goal is to have minumum failure during deployment. However it takes some time for nodes in other application to know about this killed node and the some traffic is getting directed to that which eventually fails. We tuned cluster failure detector to reduce this time, but still we see a good % failure during deployment window

What can be done to handle this. Is there a way to trap signal from Mesos and remove the node gracefully from the cluster

slowhandblues
  • 87
  • 1
  • 1
  • 12

2 Answers2

1

What I would probably do is use Akka Management with akka.management.http.route-providers-read-only set to false. This exposes the Akka Cluster Management HTTP endpoints which allow you to change cluster state via HTTP calls.

The HTTP endpoint of interest is DELETE /cluster/members/{address} where address is a Cluster URI like akka://Main@ip.add.re.ss:port. Depending on the particulars of your Marathon deployment, the IP address and port are available as environment variables to the docker entrypoint. Thus, you can modify your application launch script to, after the application exits:

  • query the Marathon API for other instances of the application
  • hit the endpoint above on the other instances (one should be sufficient thanks to the gossip protocol, but as your clusters get larger, the probability of hitting this endpoint on a node before the gossip arrives increases)

There will still be a window while the application shuts down where the other nodes believe that it's still up, but this will likely be faster than waiting for the failure detector to judge a node down (and if you have an application (e.g. one which makes heavy use of persistent actors) where you'd like to minimize false-positive failures, you can loosen the failure detection thresholds while having a quick failure detection window).

Levi Ramsey
  • 18,884
  • 1
  • 16
  • 30
0

If you can drop the dependecy on Docker you can use default executor to handle graceful shutdown.

Mesos executor sends SIGTERM and after some time (grace period) forcefully kill application. You can handle SIGTERM in your app to gracefully unregister it and exit before it get's killed.

Alternatively you can write a custom Executor

janisz
  • 6,292
  • 4
  • 37
  • 70