On a Docker Swarm running on Amazon Web Services, it has happened several times that a globally deployed service go missing on a single node of the swarm, and docker node update --force
does not bring it back on all nodes. Here is the situation according to Docker:
ubuntu@swarm-manager-a:~$ docker service ls
ID NAME MODE REPLICAS IMAGE
vayezjxwifsd logging_papertrail global 10/10 gliderlabs/logspout:latest
7w1e9zhsa9kh monitoring_dd-agent global 9/9 mycompany/myddagent
...
So for some reason, the swarm thinks that monitoring_dd-agent should only run on 9 of the 10 hosts, whereas all other global services run on all 10 hosts.
Querying for the status of the individual deployments of monitoring_dd-agent, it seems that the missing instance has shutdown by itself on node swarm-worker-d:
ubuntu@swarm-manager-a:~$ docker service ps monitoring_dd-agent
ID NAME IMAGE NODE DESIRED STATE
xlm3kalqevnr monitoring_dd-agent.4z3yz6y5 mycompany/myddagent swarm-worker-f Running Running 2 days ago
lyqw42dy8rsv monitoring_dd-agent.rguyjlhg mycompany/myddagent swarm-worker-d Shutdown Complete 4 hours ago
on5zmi18tcal monitoring_dd-agent.zcx9jo66 mycompany/myddagent swarm-manager-b Running Running 2 days ago
...
The nodes are identical, except that workers have 32GiB memory, but the managers have 16GiB:
ubuntu@swarm-manager-a:~$ docker system info
Server Version: 18.03.1-ce
Swarm: active
Managers: 3
Nodes: 10
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Kernel Version: 4.4.0-1079-aws
Operating System: Ubuntu 16.04.5 LTS
CPUs: 8
Total Memory: 15.11GiB
...
Does anyone have pointers to explanations for this weirdness?