7

I've been deploying stacks to swarms with the start-first option for quite a while now.

So given the following api.yml file:

version: '3.4'

services:

  api:
    image: registry.gitlab.com/myproj/api:${VERSION}
    deploy:
      update_config:
        order: start-first

I would run the following command against a swarm manager:

env VERSION=x.y.z docker stack deploy -f api.yml api

This worked fine - the old service kept serving requests until the new one was fully available. Only then would it be torn down and enter shutdown state.

Now recently, and I believe this started happening with docker v17.12.0-ce or v18.01.0-ce - or I didn't notice before - what happens instead is that the old service sometimes isn't correctly stopped.

When that happens it hangs around and keeps serving requests, resulting in us running a mix of old and new versions side by side indefinitely.

This happens both on swarms that have the service replicated but also on one that runs it with scale=1.

What's worse, I cannot even kill the old containers. Here's what I've tried:

  • docker service rm api_api
  • docker stack rm api && docker stack deploy -f api.yml api
  • docker rm -f <container id>

Nothing allows me to get rid of the 'zombie' container. In fact docker rm -f <container id> even locks up and simply sits there.

The only way I've found to get rid of them is to restart the node. Thanks to replication I can actually afford to do that without downtime but it's not great for various reasons, least of which is what may happen if another manager were to go down while I do that.

Has anyone else seen this behaviour? What might be the cause and how could I debug this?

sas
  • 7,017
  • 4
  • 36
  • 50
  • 1
    I am experiencing the same problem. It doesn't happen every time. I didn't find any fix. – Petr Jul 10 '19 at 11:26
  • I also experienced the same problem today. It happened after my ram usage was full and my instance stopped responding. I used docker kill to kill the previous container. – Aniket Singla Nov 07 '20 at 18:29
  • For "stuck" processes I would check if they're in uninterruptible I/O and then track down the other "side" of that I/O connection to see why. Can you find the process on the node? – joebeeson Dec 14 '20 at 21:50
  • Happens for me too. At the moment I do `docker service update --force ` to get the correct number of replicas. – Sheepwall Feb 28 '22 at 15:48

1 Answers1

1

Try to set max_replicas_per_node (1 if only needed one replica per node) in placement section

Refer to https://docs.docker.com/compose/compose-file/compose-file-v3/