1

On a Docker Swarm running on Amazon Web Services, it has happened several times that a globally deployed service go missing on a single node of the swarm, and docker node update --force does not bring it back on all nodes. Here is the situation according to Docker:

ubuntu@swarm-manager-a:~$ docker service ls
ID            NAME                 MODE    REPLICAS  IMAGE
vayezjxwifsd  logging_papertrail   global  10/10     gliderlabs/logspout:latest
7w1e9zhsa9kh  monitoring_dd-agent  global  9/9       mycompany/myddagent
...

So for some reason, the swarm thinks that monitoring_dd-agent should only run on 9 of the 10 hosts, whereas all other global services run on all 10 hosts.

Querying for the status of the individual deployments of monitoring_dd-agent, it seems that the missing instance has shutdown by itself on node swarm-worker-d:

ubuntu@swarm-manager-a:~$ docker service ps monitoring_dd-agent 
ID            NAME                          IMAGE                 NODE              DESIRED STATE
xlm3kalqevnr  monitoring_dd-agent.4z3yz6y5  mycompany/myddagent   swarm-worker-f    Running        Running 2 days ago                         
lyqw42dy8rsv  monitoring_dd-agent.rguyjlhg  mycompany/myddagent   swarm-worker-d    Shutdown       Complete 4 hours ago                       
on5zmi18tcal  monitoring_dd-agent.zcx9jo66  mycompany/myddagent   swarm-manager-b   Running        Running 2 days ago                         
...

The nodes are identical, except that workers have 32GiB memory, but the managers have 16GiB:

ubuntu@swarm-manager-a:~$ docker system info
Server Version: 18.03.1-ce
Swarm: active
 Managers: 3
 Nodes: 10
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Kernel Version: 4.4.0-1079-aws
Operating System: Ubuntu 16.04.5 LTS
CPUs: 8
Total Memory: 15.11GiB
...

Does anyone have pointers to explanations for this weirdness?

jpsecher
  • 111
  • 7

0 Answers0