0

During an update (in place in this case) of our Swarm, we have to drain a node, update it, make it active again, drain the following node, etc...

It works perfectly for the first node as the load of the containers to reschedule is spread quite fairly to all the remaining nodes but things get difficult when draining the second node as all the containers to reschedule go the recently updated node that has (almost) no task running.

The load when starting up all the services is huge compared to normal business, the node cannot keep up and some containers might fail to start due to healthcheck constraints and max_attempts policy.

Do you know of a way to reschedule and avoid that spike and unwanted results ? (priority, wait time, update strategy...) ?

Cheers, Thomas

tbrouhier
  • 1
  • 1

1 Answers1

0

This will need to be a manual process. You can pause the scheduling on the node to go down, and then gradually stop containers on that node so they migrate slowly to other nodes in the swarm cluster. E.g.

# on manager
docker node update --availability=pause node-to-stop

# on paused node
docker container ls --filter label=com.docker.swarm.task -q \
 | while read cid; do
     echo "stopping $cid"
     docker stop ${cid}
     echo "pausing"
     sleep 60
   done

Adjust the sleep command as appropriate for your environment.

BMitch
  • 231,797
  • 42
  • 475
  • 450