How to prevent Swarm nodes from freezing?

Question

I run 3 Swarm nodes, one on-premises 2-core manager, and two smaller worker ones (2-core) on AWS.

Today I experienced a DNS issue, and the manager lost connection to the workers.

After the manager was back up, it tried to run all the services on its own, and after a few seconds of the docker service running, the whole node was out of memory and froze. The same happened to the workers.

I actually can't recover the workers, because they are freezing within a few seconds.

This happened a few times in the last months, but this was the first time I couldn't fully recover.

I have already set cpu and memory limits, and for now I set node constraints, so most services won't spawn in case of a total failure.

How can this be prevented? Are there any better measures?

Sorry for the long post, but most probably someone would have asked those questions later.

If the node runs out of memory, then you have not set appropriate reservations and limits. And if an external DNS / connectivity issue causes containers to die, then you want to look at your application or healthcheck to avoid crashing based on an external dependency. — BMitch, Nov 07 '19 at 15:55
it's could be better if you provide some logs to help others to understand your issue — c4f4t0r, Nov 07 '19 at 16:55
My logs don't seem to provide anything really useful. In every situation the whole system freezes, not just the docker daemon. I will try to get the manager's logs, after I solve the whole issue and have some spare time. — Kostas C., Nov 08 '19 at 07:09

How to prevent Swarm nodes from freezing?

0 Answers0