During an update (in place in this case) of our Swarm, we have to drain a node, update it, make it active again, drain the following node, etc...
It works perfectly for the first node as the load of the containers to reschedule is spread quite fairly to all the remaining nodes but things get difficult when draining the second node as all the containers to reschedule go the recently updated node that has (almost) no task running.
The load when starting up all the services is huge compared to normal business, the node cannot keep up and some containers might fail to start due to healthcheck constraints and max_attempts policy.
Do you know of a way to reschedule and avoid that spike and unwanted results ? (priority, wait time, update strategy...) ?
Cheers, Thomas