0

I have a cluster set up in Google Kubernetes Engine (GKE), with preemptible instances, TPU support, and 1 container per node.

When the container process errors out (e.g. Python exception) GKE evicts the pod because of excessive ephemeral local storage use. I tried setting the ephemeral storage request/limit to 4Gi in case it was something to do with buffered logs, but I still get evictions like this:

Pod ephemeral local storage usage exceeds the total limit of containers 4Gi.

Looking at metrics, the increase is instant up to the 1 min sampling interval:

enter image description here

This makes me think that GKE is trying to core dump, which would be a really bad idea with the container process using more memory than total disk space.

Looking over docs, bugs, etc. I don't see how to configure or disable this - it looks like GKE's job as the invoker of "docker run", and not something configurable inside the container.

Does this theory sound right, or what else could dump this much ephemeral local storage? And if it is a core dump, how can I disable this behavior on GKE?

c b
  • 148
  • 7
  • Not sure if the real cause is the core dumps, but you can check [this](https://stackoverflow.com/questions/58704192/how-to-disable-core-file-dumps-in-docker-container) and see if it works. Also try adding size on your node pool and you will need to [migrate](https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool) it to do that. – Alex G Dec 01 '20 at 06:38
  • Unfortunately disabling dumps has to happen on the node, not inside a container image - is there a workflow for that in kubernetes/GKE? Also I'm not sure what you mean by size - disk space? – c b Dec 02 '20 at 08:40
  • You can check [Cloud Audit Logs and Cloud Logging](https://cloud.google.com/kubernetes-engine/docs/how-to/audit-logging) to be able to identify what is the root cause of the logs. Then post anything helpful that you can find then we will see what needs to be disabled. – Alex G Dec 17 '20 at 10:21

0 Answers0