Kubernetes pods failing on "Pod sandbox changed, it will be killed and re-created"

Question

On a Google Container Engine cluster (GKE), I see sometimes a pod (or more) not starting and looking in its events, I can see the following

Pod sandbox changed, it will be killed and re-created.

If I wait - it just keeps re-trying.
If I delete the pod, and allow them to be recreated by the Deployment's Replica Set, it will start properly.

The behavior is inconsistent.

Kubernetes versions 1.7.6 and 1.7.8

Any ideas?

I'm seeing this too, it logs these errors roughly once-per-second many thousands of times. — speedplane, Oct 24 '17 at 21:09
Yes - this is what I see. Anyone from @googlecloud available to comment here? — Eldad Assis, Oct 27 '17 at 12:01
I have a similar issue with microk8s 1.23.13 on raspberry pi (64 bit rasbian) — Dániel Kis, Nov 11 '22 at 09:34

score 16 · Answer 1 · edited Jun 07 '22 at 15:28

16

In my case it happened because of too little memory and CPU limits

For example, in your manifest file, increase the limits and requests from:

limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi

to this:

 limits:
    cpu: 1000m
    memory: 2048Mi
  requests:
    cpu: 500m
    memory: 1024Mi

edited Jun 07 '22 at 15:28

Promise Preston

24,334
12
145
143

answered Mar 23 '20 at 15:33

Gilad Sharaby

910
9
12

2

That was in my case. Increasing memory and cpu fixed it for me. How could the error be so obfuscated? I lost hours because of this. – tozka Aug 12 '20 at 20:35
This is ridiculous. For me it was because I put `memory: 300m` instead of `memory: 300Mi` – Phil Oct 05 '20 at 11:06
Event log in pods showing exactly same error, on VM that is running on legacy server. Should i overall increase VM CPU and memory or in some config file.? Please let me know thanks.! – surya kiran Apr 11 '22 at 19:39

score 7 · Accepted Answer · answered Oct 30 '17 at 15:33

7

I can see following message posted in Google Cloud Status Dashboard:

"We are investigating an issue affecting Google Container Engine (GKE) clusters where after docker crashes or is restarted on a node, pods are unable to be scheduled.

The issue is believed to be affecting all GKE clusters running Kubernetes v1.6.11, v1.7.8 and v1.8.1.

Our Engineering Team suggests: If nodes are on release v1.6.11, please downgrade your nodes to v1.6.10. If nodes are on release v1.7.8, please downgrade your nodes to v1.7.6. If nodes are on v1.8.1, please downgrade your nodes to v1.7.6.

Alternative workarounds are also provided by the Engineering team in this doc . These workarounds are applicable to the customers that are unable to downgrade their nodes."

answered Oct 30 '17 at 15:33

Carlos

966
7
15

Interesting. Nice catch, although I had this also at 1.7.6. I will try one of the workarounds and update! – Eldad Assis Oct 30 '17 at 19:21
Current status - I tried one of Google's workarounds. It did not help. I downgraded the cluster nodes to 1.7.6 (which I already had issues in). Seems to be better, but still unsure. – Eldad Assis Oct 31 '17 at 14:38
No luck. Still getting these errors. Google are rolling a fix, so I hope this helps. – Eldad Assis Nov 03 '17 at 18:42
1

Eldad AK, if you downgraded to 1.7.6 and are still seeing the issue, it may not be related to the incident. You should check the events and/or kubelet log to see if there are any errors starting/running the PodSandbox. – Yu-Ju Hong Nov 07 '17 at 18:22
Latest update - upgrading the cluster to 1.8.1-gke.1 seemed to have solved these issues (for now). It's been running several days without a single error related to my original post. – Eldad Assis Nov 12 '17 at 07:46

score 3 · Answer 3 · answered Nov 01 '17 at 19:06

3

I was affected by same issue on one node in GKE 1.8.1 cluster (other nodes were fine). I did following:

Make sure your node pool has some headroom to receive all pods scheduled on affected node. When in doubt, increase node pool by 1.
Drain affected node following this manual:
```
kubectl drain <node>
```
You may run into warnings about daemonsets or pods with local storage, proceed with operation.
Power down affected node in Compute Engine. GKE should schedule replacement node if your pool size is smaller than specified in pool description.

answered Nov 01 '17 at 19:06

3

This is a good solution for a bad node, but my issues seem to happen on more than a single node. And they are not always at the same time, so it feels like a ghost hunt. – Eldad Assis Nov 02 '17 at 06:33
Sure, big clusters with multiple problem nodes will require too much manual work with this solution. I hope this answer helps someone with small cluster who happens to find this thread. – Nov 03 '17 at 12:24

Kubernetes pods failing on "Pod sandbox changed, it will be killed and re-created"

3 Answers3

Linked