1

I have a docker swarm cluster, masters running on 6 AWS instances, during some testing, we accidentally terminated 3 instances ( running masters). Now the swarm state seems not working generating error like :

Error: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online

I tried removing the terminated managers through docker commands but whatever command I do like docker node ls or other commands it gives me the same error as above. I also tried adding new node, while adding to swarm it generates the same error.

I can see all the terminated instances IP's when I issue docker info inside one of the managers but cant do anything . How Can I recover from this state?

 Node Address: 10.80.8.195
 Manager Addresses:
  10.80.7.104:2377
  10.80.7.213:2377
  10.80.7.226:2377
  10.80.7.91:2377
  10.80.8.195:2377
  10.80.8.219:2377

1 Answers1

0

The clustering facility within the swarm is maintained by the manager nodes. In your case, you lost the cluster quorum by deleting half of the cluster managers. In this particular case, no node could elect a new manager leader and no manager could control the swarm.

In this case, the only way to recover your cluster is to re-initializing it and this will force the creation of a new cluster.

On a manager node, run this command:

docker swarm init --force-new-cluster

And on other manager nodes, I don't remember if they join the new cluster or if you need to leave and join the cluster again.

jmaitrehenry
  • 2,190
  • 21
  • 31
  • 1
    Hi, Thanks for the answer . It dnt work on 1 manager but it worked on another and all the managers need to re join cluster again . But i always thought reinitializing might create new cluster with new token ID's seems it is not – Surya Prakash Patel May 20 '20 at 10:09