I run 3 Swarm nodes, one on-premises 2-core manager, and two smaller worker ones (2-core) on AWS.
Today I experienced a DNS issue, and the manager lost connection to the workers.
After the manager was back up, it tried to run all the services on its own, and after a few seconds of the docker service running, the whole node was out of memory and froze. The same happened to the workers.
I actually can't recover the workers, because they are freezing within a few seconds.
This happened a few times in the last months, but this was the first time I couldn't fully recover.
I have already set cpu and memory limits, and for now I set node constraints, so most services won't spawn in case of a total failure.
How can this be prevented? Are there any better measures?
Sorry for the long post, but most probably someone would have asked those questions later.