I've set up a docker swarm mode cluster, with two managers and one worker. This is on Centos 7. They're on machines dkr1, dkr2, dkr3. dkr3 is the worker.
I was upgrading to v1.13 the other day, and wanted zero downtime. But it didn't work exactly as expected. I'm trying to work out the correct way to do it, since this is one of the main goals, of having a cluster.
The swarm is in 'global' mode. That is, one replica per machine. My method for upgrading was to drain the node, stop the daemon, yum upgrade, start daemon. (Note that this wiped out my daemon config settings for ExecStart=...! Be careful if you upgrade.)
Our client/ESB hits dkr2, which does its load balancing magic over the swarm. dkr2 which is the leader. dkr1 is 'reachable'
I brought down dkr3. No issues. Upgraded docker. Brought it back up. No downtime from bringing down the worker.
Brought down dkr1. No issue at first. Still working when I brought it down. Upgraded docker. Brought it back up. But during startup, it 404'ed. Once up, it was OK.
Brought down dkr2. I didn't actually record what happened then, sorry.
Anyway, while my app was starting up on dkr1, it 404'ed, since the server hadn't started yet.
Any idea what I might be doing wrong? I would suppose I need a health check of some sort, because the container is obviously ok, but the server isn't responding yet. So that's when I get downtime.