1

We are using docker swarm in our production environment. Here is the output of docker node ls command.

ID                            HOSTNAME                         STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
5qpi2zmdonheusou7fgkh9m1g     ip-10-x-241-y.ec2.internal    Ready     Active         Leader           20.10.2
h5nway19ms4po91f0pjzar22b     ip-10-x-241-y.ec2.internal   Ready     Active                          20.10.2
79sikbrre17pf495vijjpydy0 *   ip-10-x-241-y.ec2.internal   Ready     Active         Reachable        20.10.2
u83yq5n5gi7rdkit5i3i6gj6i     ip-10-x-243-y.ec2.internal   Ready     Active                          20.10.2
o87buageysj1vbcefc9xz4wbe     ip-10-x-243-y.ec2.internal   Ready     Active         Reachable        20.10.2

And here is the docker service ls command output:

ID             NAME                                  MODE         REPLICAS   IMAGE                                                                 PORTS
m21u7z06tzqw   portainer-app                         replicated   1/1        portainer/portainer:latest                                            *:9002->9000/tcp
jrk2trgqc2r1   aaaaaaaaaaaaaaaaaaaaa                 global       1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx         *:9200->9200/tcp, *:9300->9300/tcp
3sevi4nv5lnj   bbbbbbbbbbbbbb                        global       1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                *:5601->5601/tcp
vpij8elkdcqr   cccccccccccccccc                      global       1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx              *:5000->5000/tcp
etyu98fr7fc4   ddddddddddddddddddddddddddddddddddd   global       1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
6spidjk8e4dr   eeeeeeeeeeeeeeeeeeeeee                replicated   1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxx
v5h58ms3as3a   fffffffffffffffffffffffffffff         global       1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxx
qb56lj6bb8k6   gggggggggggggggggggggggggggggggg      global       1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxx
3wa4fmhtwxsr   hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh      global       1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxx
2kenua5sdrfa   iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii   global       1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxx
amq6qls538qy   jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj       global       1/1        xxxxxxxxxxxxxxxxxxxxxxxxxxxx
qude01eq2c5j   kkkkkkkkkkkkkkkkkkkkkkkkk             global       2/2        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx             *:443->9000/tcp, *:9000->9000/tcp
uirjzopva1rq   llllllllllllllllllll                  global       2/2        xxxxxxxxxxxx

This configurations are working properly more than a year. But last weekend, ops team applied security patches and rebooted the worker node machines. After that one of the worker nodes "u83yq5n5gi7rdkit5i3i6gj6i" doesn't run any container. I remove the node from swarm and added it as worker again but nothing changed. Also I did service update but it only restarts the container in one worker node. Because they are running in global mode, I couldn't scale the services to run 2 containers(it gives error that scaling works only in replica mode). The expected behavior is, after adding a worker node, swarm will auto deploy new containers to new worker node but it didn't.

I believe docker swarm is logging the issue while it couldn't deploy containers on the new worker node but I couldn't find the correct location of the log.

Since it is a production environment, I couldn't recreate docker swarm from scratch. I need to find a way for docker swarm to deploy services in the other worker node.

Any idea?

Ercan Celik
  • 405
  • 1
  • 6
  • 15
  • You've already checked for any inadvertent changes to firewalls on the patched systems? – J. Scott Elblein Jan 20 '21 at 13:27
  • Have you checked the output of `docker service ps --no-trunc ${service_name}`for those services? Are your ec2 instances in different availability zones? Are you sure you allow the required traffic amongst all AZ`s? – Metin Jan 20 '21 at 18:54

1 Answers1

0

I faced the same error message as you and solved it by making sure the new node has the Internet connection to download the image. One of two worker node in my cluster can't run any application when I deploy from swarm, and it is working fine after it can pull the image from the Internet. I hope this help you.