We have several docker swarm clusters in production. Each of them has 4-10 nodes. We have 3-5 manager
nodes in each of them as per the recommendation to meet raft quorum. These are hosted in EC2 instances in different AWS regions.
Our cluster became unstable due to a recent outage in one of the AWS AZs and since then I'm running this production cluster with just one manager
which is not recommended. My attempt to promote other nodes succeeds, but in a few seconds, I lose the cluster and I see the message The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Only way to recover the cluster is to reinitialize swarm cluster again on the manager node. For eg. swarm init --force-new-cluster
.
Though the cluster continues to work after this, I see inconsistent node list. The node I attempted to add remain down
and I must forcefully leave the swarm and remove the node from manager cluster. Attempt to remove this node from cluster results in the message the node you are attempting to remove is a manager
. docker node inspect <nodeid>
shows its Role
as manager
but any cluster level operations on the node fails stating that the node is not a manager.
I can't afford to lose this cluster and recreate as there are live customers who expect high uptime. It's risky leaving the setup as I don't have sufficient manager
nodes and any operation related to swarm cluster is leaving the cluster in an inconsistent state.
dockerd
logs in /var/log/syslog
is not very helpful in identifying the root cause. Following is the log snippet I found when I promote another node as manager
dockerd[24061]: time="2021-03-03T05:43:12.576569707Z" level=info msg="dispatcher stopping" method="(*Dispatcher).Stop" module=dispatcher node.id=t2x1wbrgq616fsx6v7ay2euqn
dockerd[24061]: time="2021-03-03T05:43:12.576746632Z" level=info msg="dispatcher session dropped, marking node ig3w333ffmkdbff9hz1g58d0f down" method="(*Dispatcher).Session" node.id=ig3w333ffmkdbff9hz1g58d0f node.session=c19degio4wtlcfm2xwkjsr7oq
dockerd[24061]: time="2021-03-03T05:43:12.576772083Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=ig3w333ffmkdbff9hz1g58d0f node.session=c19degio4wtlcfm2xwkjsr7oq
dockerd[24061]: time="2021-03-03T05:43:12.576858405Z" level=info msg="dispatcher session dropped, marking node 3zgtdu6meh3ixf8h7vtwpo2ic down" method="(*Dispatcher).Session" node.id=3zgtdu6meh3ixf8h7vtwpo2ic node.session=1yqzxseeumz6m8cqy4bfpqq9f
dockerd[24061]: time="2021-03-03T05:43:12.576878476Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=3zgtdu6meh3ixf8h7vtwpo2ic node.session=1yqzxseeumz6m8cqy4bfpqq9f
dockerd[24061]: time="2021-03-03T05:43:12.576986159Z" level=info msg="dispatcher session dropped, marking node bk5mapetwtri0gax5o8zc29k2 down" method="(*Dispatcher).Session" node.id=bk5mapetwtri0gax5o8zc29k2 node.session=9ae2u0v2dlyyo8do3tv7e4b8m
dockerd[24061]: time="2021-03-03T05:43:12.577008040Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=bk5mapetwtri0gax5o8zc29k2 node.session=9ae2u0v2dlyyo8do3tv7e4b8m
dockerd[24061]: time="2021-03-03T05:43:12.577097253Z" level=info msg="dispatcher session dropped, marking node znbimzdsduwzzfggl1uno3n9c down" method="(*Dispatcher).Session" node.id=znbimzdsduwzzfggl1uno3n9c node.session=xz5xvsnc1tr0eb0p5ytd8eu5s
dockerd[24061]: time="2021-03-03T05:43:12.577127094Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=znbimzdsduwzzfggl1uno3n9c node.session=xz5xvsnc1tr0eb0p5ytd8eu5s
dockerd[24061]: time="2021-03-03T05:43:12.577268958Z" level=info msg="dispatcher session dropped, marking node c6fnnzlrcb59ob0mcpwaf52f7 down" method="(*Dispatcher).Session" node.id=c6fnnzlrcb59ob0mcpwaf52f7 node.session=mt8sd2646xgk9lw549ydri43a
dockerd[24061]: time="2021-03-03T05:43:12.577326000Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=c6fnnzlrcb59ob0mcpwaf52f7 node.session=mt8sd2646xgk9lw549ydri43a
dockerd[24061]: time="2021-03-03T05:43:12.577396992Z" level=info msg="dispatcher session dropped, marking node p6r18tf30q6boy5woygvnfnrv down" method="(*Dispatcher).Session" node.id=p6r18tf30q6boy5woygvnfnrv node.session=680gvutmvvyf14sy6yjc3sf00
dockerd[24061]: time="2021-03-03T05:43:12.577426743Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=p6r18tf30q6boy5woygvnfnrv node.session=680gvutmvvyf14sy6yjc3sf00
dockerd[24061]: time="2021-03-03T05:43:12.577309389Z" level=info msg="dispatcher session dropped, marking node t2x1wbrgq616fsx6v7ay2euqn down" method="(*Dispatcher).Session" node.id=t2x1wbrgq616fsx6v7ay2euqn node.session=dii2khkdxlqvs2ckao7hhin0u
dockerd[24061]: time="2021-03-03T05:43:12.577502235Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=t2x1wbrgq616fsx6v7ay2euqn node.session=dii2khkdxlqvs2ckao7hhin0u
dockerd[24061]: time="2021-03-03T05:43:12.578248017Z" level=info msg="leadership changed from not yet part of a raft cluster to no cluster leader" module=node node.id=t2x1wbrgq616fsx6v7ay2euqn
dockerd[24061]: time="2021-03-03T05:43:12.578520436Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Aborted desc = node must disconnect" module=node/agent node.id=t2x1wbrgq616fsx6v7ay2euqn
dockerd[24061]: time="2021-03-03T05:43:12.578593468Z" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=t2x1wbrgq616fsx6v7ay2euqn
dockerd[24061]: time="2021-03-03T05:43:12.578634949Z" level=info msg="waiting 77.462294ms before registering session" module=node/agent node.id=t2x1wbrgq616fsx6v7ay2euqn
dockerd[24061]: time="2021-03-03T05:43:17.656467883Z" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=t2x1wbrgq616fsx6v7ay2euqn
This is a 7-node cluster with 3 nodes running v20.10.5 and 4 nodes running v19.03.13 on Ubuntu LTS 18.04. Any pointers to troubleshoot this and recover the cluster is appreciated.