0

I run 3 Swarm nodes, one on-premises 2-core manager, and two smaller worker ones (2-core) on AWS.

Today I experienced a DNS issue, and the manager lost connection to the workers.

After the manager was back up, it tried to run all the services on its own, and after a few seconds of the docker service running, the whole node was out of memory and froze. The same happened to the workers.

I actually can't recover the workers, because they are freezing within a few seconds.

This happened a few times in the last months, but this was the first time I couldn't fully recover.

I have already set cpu and memory limits, and for now I set node constraints, so most services won't spawn in case of a total failure.

How can this be prevented? Are there any better measures?

Sorry for the long post, but most probably someone would have asked those questions later.

  • If the node runs out of memory, then you have not set appropriate reservations and limits. And if an external DNS / connectivity issue causes containers to die, then you want to look at your application or healthcheck to avoid crashing based on an external dependency. – BMitch Nov 07 '19 at 15:55
  • 2
    it's could be better if you provide some logs to help others to understand your issue – c4f4t0r Nov 07 '19 at 16:55
  • My logs don't seem to provide anything really useful. In every situation the whole system freezes, not just the docker daemon. I will try to get the manager's logs, after I solve the whole issue and have some spare time. – Kostas C. Nov 08 '19 at 07:09

0 Answers0