We are running a akka cluster in docker and running in Mesos. The structure is such that 3 different applications (each having 4 instances) talk to each other within the cluster
When we want to do a deployment, we are using Marathon upgrade strategy feature to deploy. The way it is configured that it will create a new node with latest deplyment and then kill one of the old nodes and continue this process till all nodes are up. We are using below cofiguration to achieve the same (for 4 nodes)
"upgradeStrategy": {
"minimumHealthCapacity": 1,
"maximumOverCapacity": 0.3
},
Our main goal is to have minumum failure during deployment. However it takes some time for nodes in other application to know about this killed node and the some traffic is getting directed to that which eventually fails. We tuned cluster failure detector to reduce this time, but still we see a good % failure during deployment window
What can be done to handle this. Is there a way to trap signal from Mesos and remove the node gracefully from the cluster