Recovering a docker swarm cluster

Question

We have several docker swarm clusters in production. Each of them has 4-10 nodes. We have 3-5 manager nodes in each of them as per the recommendation to meet raft quorum. These are hosted in EC2 instances in different AWS regions.

Our cluster became unstable due to a recent outage in one of the AWS AZs and since then I'm running this production cluster with just one manager which is not recommended. My attempt to promote other nodes succeeds, but in a few seconds, I lose the cluster and I see the message The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online. Only way to recover the cluster is to reinitialize swarm cluster again on the manager node. For eg. swarm init --force-new-cluster.

Though the cluster continues to work after this, I see inconsistent node list. The node I attempted to add remain down and I must forcefully leave the swarm and remove the node from manager cluster. Attempt to remove this node from cluster results in the message the node you are attempting to remove is a manager. docker node inspect <nodeid> shows its Role as manager but any cluster level operations on the node fails stating that the node is not a manager.

I can't afford to lose this cluster and recreate as there are live customers who expect high uptime. It's risky leaving the setup as I don't have sufficient manager nodes and any operation related to swarm cluster is leaving the cluster in an inconsistent state.

dockerd logs in /var/log/syslog is not very helpful in identifying the root cause. Following is the log snippet I found when I promote another node as manager

 dockerd[24061]: time="2021-03-03T05:43:12.576569707Z" level=info msg="dispatcher stopping" method="(*Dispatcher).Stop" module=dispatcher node.id=t2x1wbrgq616fsx6v7ay2euqn
 dockerd[24061]: time="2021-03-03T05:43:12.576746632Z" level=info msg="dispatcher session dropped, marking node ig3w333ffmkdbff9hz1g58d0f down" method="(*Dispatcher).Session" node.id=ig3w333ffmkdbff9hz1g58d0f node.session=c19degio4wtlcfm2xwkjsr7oq
 dockerd[24061]: time="2021-03-03T05:43:12.576772083Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=ig3w333ffmkdbff9hz1g58d0f node.session=c19degio4wtlcfm2xwkjsr7oq
 dockerd[24061]: time="2021-03-03T05:43:12.576858405Z" level=info msg="dispatcher session dropped, marking node 3zgtdu6meh3ixf8h7vtwpo2ic down" method="(*Dispatcher).Session" node.id=3zgtdu6meh3ixf8h7vtwpo2ic node.session=1yqzxseeumz6m8cqy4bfpqq9f
 dockerd[24061]: time="2021-03-03T05:43:12.576878476Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=3zgtdu6meh3ixf8h7vtwpo2ic node.session=1yqzxseeumz6m8cqy4bfpqq9f
 dockerd[24061]: time="2021-03-03T05:43:12.576986159Z" level=info msg="dispatcher session dropped, marking node bk5mapetwtri0gax5o8zc29k2 down" method="(*Dispatcher).Session" node.id=bk5mapetwtri0gax5o8zc29k2 node.session=9ae2u0v2dlyyo8do3tv7e4b8m
 dockerd[24061]: time="2021-03-03T05:43:12.577008040Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=bk5mapetwtri0gax5o8zc29k2 node.session=9ae2u0v2dlyyo8do3tv7e4b8m
 dockerd[24061]: time="2021-03-03T05:43:12.577097253Z" level=info msg="dispatcher session dropped, marking node znbimzdsduwzzfggl1uno3n9c down" method="(*Dispatcher).Session" node.id=znbimzdsduwzzfggl1uno3n9c node.session=xz5xvsnc1tr0eb0p5ytd8eu5s
 dockerd[24061]: time="2021-03-03T05:43:12.577127094Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=znbimzdsduwzzfggl1uno3n9c node.session=xz5xvsnc1tr0eb0p5ytd8eu5s
 dockerd[24061]: time="2021-03-03T05:43:12.577268958Z" level=info msg="dispatcher session dropped, marking node c6fnnzlrcb59ob0mcpwaf52f7 down" method="(*Dispatcher).Session" node.id=c6fnnzlrcb59ob0mcpwaf52f7 node.session=mt8sd2646xgk9lw549ydri43a
 dockerd[24061]: time="2021-03-03T05:43:12.577326000Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=c6fnnzlrcb59ob0mcpwaf52f7 node.session=mt8sd2646xgk9lw549ydri43a
 dockerd[24061]: time="2021-03-03T05:43:12.577396992Z" level=info msg="dispatcher session dropped, marking node p6r18tf30q6boy5woygvnfnrv down" method="(*Dispatcher).Session" node.id=p6r18tf30q6boy5woygvnfnrv node.session=680gvutmvvyf14sy6yjc3sf00
 dockerd[24061]: time="2021-03-03T05:43:12.577426743Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=p6r18tf30q6boy5woygvnfnrv node.session=680gvutmvvyf14sy6yjc3sf00
 dockerd[24061]: time="2021-03-03T05:43:12.577309389Z" level=info msg="dispatcher session dropped, marking node t2x1wbrgq616fsx6v7ay2euqn down" method="(*Dispatcher).Session" node.id=t2x1wbrgq616fsx6v7ay2euqn node.session=dii2khkdxlqvs2ckao7hhin0u
 dockerd[24061]: time="2021-03-03T05:43:12.577502235Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=t2x1wbrgq616fsx6v7ay2euqn node.session=dii2khkdxlqvs2ckao7hhin0u
 dockerd[24061]: time="2021-03-03T05:43:12.578248017Z" level=info msg="leadership changed from not yet part of a raft cluster to no cluster leader" module=node node.id=t2x1wbrgq616fsx6v7ay2euqn
 dockerd[24061]: time="2021-03-03T05:43:12.578520436Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Aborted desc = node must disconnect" module=node/agent node.id=t2x1wbrgq616fsx6v7ay2euqn
 dockerd[24061]: time="2021-03-03T05:43:12.578593468Z" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=t2x1wbrgq616fsx6v7ay2euqn
 dockerd[24061]: time="2021-03-03T05:43:12.578634949Z" level=info msg="waiting 77.462294ms before registering session" module=node/agent node.id=t2x1wbrgq616fsx6v7ay2euqn
 dockerd[24061]: time="2021-03-03T05:43:17.656467883Z" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=t2x1wbrgq616fsx6v7ay2euqn

This is a 7-node cluster with 3 nodes running v20.10.5 and 4 nodes running v19.03.13 on Ubuntu LTS 18.04. Any pointers to troubleshoot this and recover the cluster is appreciated.

Recovering a docker swarm cluster

0 Answers0