0

Swarm Gurus,

I have just setup my very first Docker Swarm environment with 3 hosts. By following the manuals here:

https://docs.docker.com/engine/install/ubuntu/
https://docs.docker.com/engine/swarm/swarm-tutorial/
https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/
https://docs.docker.com/engine/swarm/swarm-tutorial/deploy-service/
https://docs.docker.com/engine/swarm/swarm-tutorial/scale-service/

I was able to set it up and was able to create a service with 5 replicas and working as expected. The containers were spread across 3 Nodes (Manager and 2 Worker Nodes).

Then I started to experiment by shutting down all the 3 Nodes and starting them up. The service I have created (named helloworld) was automatically spawned up by docker and was restored as a swarm.

But I noticed one thing, the original containers were no longer there but instead I got this:

someuser@manager:~$ docker service ps helloworld --no-trunc
ID                          NAME               IMAGE                                                                                   NODE      DESIRED STATE   CURRENT STATE            ERROR                                                         PORTS
8vlswsfq8ub5xn9vd401ilskn   helloworld.1       alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   manager   Running         Running 30 minutes ago
jqfgg41xppf7xcchnkvjyesyx    \_ helloworld.1   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   manager   Shutdown        Failed 30 minutes ago    "No such container: helloworld.1.jqfgg41xppf7xcchnkvjyesyx"
wy382jy2yncpv6b3y1y0qfq3h   helloworld.2       alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   manager   Running         Running 30 minutes ago
mq7w469vck8hzr7p9w22f0rt1    \_ helloworld.2   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   manager   Shutdown        Failed 30 minutes ago    "No such container: helloworld.2.mq7w469vck8hzr7p9w22f0rt1"
jp5wbvbdxxgh60vzef9iz73aj   helloworld.3       alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   worker01   Running         Running 30 minutes ago
t5wgad0dhu5hoyp3kjrdela4b    \_ helloworld.3   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   worker01   Shutdown        Failed 30 minutes ago    "No such container: helloworld.3.t5wgad0dhu5hoyp3kjrdela4b"
km03jrxlvam162i8pt2ix6vlf   helloworld.4       alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   worker02   Running         Running 29 minutes ago
8hjnbjz4nmpqncmva4ubeqpx6    \_ helloworld.4   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   worker02   Shutdown        Failed 30 minutes ago    "No such container: helloworld.4.8hjnbjz4nmpqncmva4ubeqpx6"
knbvl6el13l0poofdv1g6j11z   helloworld.5       alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   worker02   Running         Running 29 minutes ago
thlnyngdbwwsi30fuxx4wx7cd    \_ helloworld.5   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   worker02   Shutdown        Failed 30 minutes ago    "No such container: helloworld.5.thlnyngdbwwsi30fuxx4wx7cd"

I am totally fine with the new containers, since I had not gracefully shutdown the nodes and not shutting them down gracefully is part of the test case.

But I want to get rid of the nodes that have failed. Which are the following:

jqfgg41xppf7xcchnkvjyesyx    \_ helloworld.1   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   manager   Shutdown        Failed 30 minutes ago    "No such container: helloworld.1.jqfgg41xppf7xcchnkvjyesyx"
mq7w469vck8hzr7p9w22f0rt1    \_ helloworld.2   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   manager   Shutdown        Failed 30 minutes ago    "No such container: helloworld.2.mq7w469vck8hzr7p9w22f0rt1"
t5wgad0dhu5hoyp3kjrdela4b    \_ helloworld.3   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   worker01   Shutdown        Failed 30 minutes ago    "No such container: helloworld.3.t5wgad0dhu5hoyp3kjrdela4b"
8hjnbjz4nmpqncmva4ubeqpx6    \_ helloworld.4   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   worker02   Shutdown        Failed 30 minutes ago    "No such container: helloworld.4.8hjnbjz4nmpqncmva4ubeqpx6"
thlnyngdbwwsi30fuxx4wx7cd    \_ helloworld.5   alpine:latest@sha256:21a3deaa0d32a8057914f36584b5288d2e5ecc984380bc0118285c70fa8c9300   worker02   Shutdown        Failed 30 minutes ago    "No such container: helloworld.5.thlnyngdbwwsi30fuxx4wx7cd"

I tried the following:

$ docker rm \_ helloworld.1
$ docker rm \helloworld.1.jqfgg41xppf7xcchnkvjyesyx
$ docker rm --link \_ helloworld.1
$ docker rm --link \helloworld.1.jqfgg41xppf7xcchnkvjyesyx

But all these didn't work.

Your advice is much appreciated.

Thanks

Artanis Zeratul
  • 963
  • 2
  • 14
  • 40

1 Answers1

0

docker ps lists all the tasks associated with a service, and tasks can be in a variety of states: started, running, complete etc.

Running tasks are associated with a container.

The utility of tracking the tasks independently is that, from the docker service ps list, you can use the task id, rather than the service id in some docker commands, such as docker service logs <task id> in which case you can find out specifically why a particular task failed.

You can also docker inspect <task id> which will return a block of data indicating, perhaps why a task could not be started at all. But if it did start, the container id that actually ran the task, which you can use to go to the actual node, and examine for things like OOM errors or in-container logs.

You can clean up the containers associated with finished tasks, but docker automatically retains task history thats appropriate to the --max-update-retries number - setting this value smaller keeps the history smaller - but you still can;t (and really would not want to) clear it.

Chris Becke
  • 34,244
  • 12
  • 79
  • 148
  • it says "No such container: helloworld.1.jqfgg41xppf7xcchnkvjyesyx" so there is no way I will be able to fetch its logs or inspect it. – Artanis Zeratul Jan 13 '22 at 20:16
  • "no such container" is because you did a system prune which removes stopped containers, wether or not they are associated with tasks.Once youve pruned the containers the logs are, indeed, gone. – Chris Becke Jan 14 '22 at 06:37
  • I didn't do a System Prune. I just shut down the host machine forcefully. Simulating power failure or hardware failure. – Artanis Zeratul Feb 13 '22 at 05:51