I've been deploying stacks to swarms with the start-first
option for quite a while now.
So given the following api.yml
file:
version: '3.4'
services:
api:
image: registry.gitlab.com/myproj/api:${VERSION}
deploy:
update_config:
order: start-first
I would run the following command against a swarm manager:
env VERSION=x.y.z docker stack deploy -f api.yml api
This worked fine - the old service kept serving requests until the new one was fully available. Only then would it be torn down and enter shutdown state.
Now recently, and I believe this started happening with docker v17.12.0-ce or v18.01.0-ce - or I didn't notice before - what happens instead is that the old service sometimes isn't correctly stopped.
When that happens it hangs around and keeps serving requests, resulting in us running a mix of old and new versions side by side indefinitely.
This happens both on swarms that have the service replicated but also on one that runs it with scale=1
.
What's worse, I cannot even kill the old containers. Here's what I've tried:
docker service rm api_api
docker stack rm api && docker stack deploy -f api.yml api
docker rm -f <container id>
Nothing allows me to get rid of the 'zombie' container. In fact docker rm -f <container id>
even locks up and simply sits there.
The only way I've found to get rid of them is to restart the node. Thanks to replication I can actually afford to do that without downtime but it's not great for various reasons, least of which is what may happen if another manager were to go down while I do that.
Has anyone else seen this behaviour? What might be the cause and how could I debug this?