I have a cluster set up in Google Kubernetes Engine (GKE), with preemptible instances, TPU support, and 1 container per node.
When the container process errors out (e.g. Python exception) GKE evicts the pod because of excessive ephemeral local storage use. I tried setting the ephemeral storage request/limit to 4Gi in case it was something to do with buffered logs, but I still get evictions like this:
Pod ephemeral local storage usage exceeds the total limit of containers 4Gi.
Looking at metrics, the increase is instant up to the 1 min sampling interval:
This makes me think that GKE is trying to core dump, which would be a really bad idea with the container process using more memory than total disk space.
Looking over docs, bugs, etc. I don't see how to configure or disable this - it looks like GKE's job as the invoker of "docker run", and not something configurable inside the container.
Does this theory sound right, or what else could dump this much ephemeral local storage? And if it is a core dump, how can I disable this behavior on GKE?