Docker Node is Down after service restart

Question

It seems my server ran out of space and I was having some problems with some of the deployed docker stacks. Took me a while to figure it out, but eventually I did and removed a couple of containers and images to free some space.

I was able to run service docker restart and it worked. However, there are some problems:

docker info says the swarm is "Pending"
docker node ls shows the only node I have (Leader), it is available but it is down
journalctl -f -u docker says `level=error msg="error removing task " error="incompatible value module=node/agent/worker node.id="

When running docker service ls, all services have 0/1 replicas.

This is the status when running docker node ls

"Status": {
    "State": "down",
    "Message": "heartbeat failure for node in \"unknown\" state",
    "Addr": "<ip and port>"
},
"ManagerStatus": {
    "Leader": true,
    "Reachability": "reachable",
    "Addr": "<ip and port>"
}

How can I get my services running again?

How many managers and workers? What versions of docker engine? What OS and distribution? — Bret Fisher, Jun 21 '18 at 17:17

Yor Jaggy · Answer 1 · 2020-10-19T15:59:32.537

Sometimes when you restart or update your docker version the tasks.db file gets corrupted.

This is an open issue (#34827), some people have suggested a workaround to this issue moving the tasks.db file and testing if this fixes the issue then delete the tasks.db file. Docker automatically will create a new one for you.

You can find the tasks.db file in /var/lib/docker/swarm/worker/

I've faced the same issue recently and this workaround saved my day. I didn't lose any data related to my Stacks

Update October/19/2020

issue (#34827) is closed but the solution still the same, remove the tasks.db file

score 4 · Answer 2 · answered Jan 10 '19 at 10:03

4

Option 1:

Wait. Sometimes it fixes itself.

Option 2 (May vary depending on OS):

systemctl stop docker
rm -Rf /var/lib/docker/swarm
systemctl start docker
docker swarm init

answered Jan 10 '19 at 10:03

Javier Yáñez

571
5
7

2

Would be worth noting that doing this without backups means you lose all the definitions within the swarm – Dockstar Nov 06 '19 at 01:53
2

Yes, is not a solution for a production swarm. But for only one node for development, it's valid. – Javier Yáñez Nov 06 '19 at 15:00
Instead of wiping the entire swarm folder (which will essentially wipe your swarm), only delete /var/lib/docker/swarm/worker/tasks.db. If you do that, the last "docker swarm init" is not necessary – Jeramy Rutley Jun 10 '20 at 17:26

score 0 · Answer 3 · answered Nov 06 '19 at 16:12

i have found next solution https://forums.docker.com/t/docker-worker-nodes-shown-as-down-after-re-start/22329

Leader node after docker service was restarted was down.

I have fixed this by promoting worker node as manager node and then on the new manager node demote failed leader node.

ubuntu@staging1:~$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y0363og32cur9xq9yy0nqg6j9 * staging1 Down Active Reachable
x68yyqtt0rogmabec552634mf staging2 Ready Active

ubuntu@staging1:~$ docker node promote staging2

root@staging1:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
plxkuqqnkxotrzy7nhjj27w34 * staging1 Down Active Leader
x68yyqtt0rogmabec552634mf staging2 Ready Active Reachable

root@staging2:~# docker node demote staging1

root@staging2:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
plxkuqqnkxotrzy7nhjj27w34 staging1 Down Active
x68yyqtt0rogmabec552634mf * staging2 Ready Active Leader

root@staging2:~# docker node rm staging1

Get join-token from leader node:
root@staging2:~# docker swarm join-token manager

Reconnect failed node to docker swarm cluster:

root@staging1:~# docker swarm leave --force
root@staging1:~# systemctl stop docker
root@staging1:~# rm -rf /var/lib/docker/swarm/
root@staging1:~# systemctl start docker
root@staging1:~# docker swarm join --token XXXXXXXX 192.168.XX.XX:2377

root@staging1:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y0363og32cur9xq9yy0nqg6j9 * staging1 Ready Active Reachable
x68yyqtt0rogmabec552634mf staging2 Ready Active Leader

root@staging1:~# docker node demote staging2

root@staging1:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y0363og32cur9xq9yy0nqg6j9 * staging1 Ready Active Leader
x68yyqtt0rogmabec552634mf staging2 Ready Active

score -2 · Answer 4 · answered Oct 22 '19 at 04:38

first check details of node: **

docker node ls

** if status of node is still showing down and availability is active then may be service running on node get stop. create service as global mode
OR update the global service running in swarm by following commands:

docker service update --force

Docker Node is Down after service restart

4 Answers4

Linked