29

On a Google Container Engine cluster (GKE), I see sometimes a pod (or more) not starting and looking in its events, I can see the following

Pod sandbox changed, it will be killed and re-created.

If I wait - it just keeps re-trying.
If I delete the pod, and allow them to be recreated by the Deployment's Replica Set, it will start properly.

The behavior is inconsistent.

Kubernetes versions 1.7.6 and 1.7.8

Any ideas?

Eldad Assis
  • 10,464
  • 11
  • 52
  • 78

3 Answers3

16

In my case it happened because of too little memory and CPU limits

For example, in your manifest file, increase the limits and requests from:

limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi

to this:

 limits:
    cpu: 1000m
    memory: 2048Mi
  requests:
    cpu: 500m
    memory: 1024Mi
Promise Preston
  • 24,334
  • 12
  • 145
  • 143
Gilad Sharaby
  • 910
  • 9
  • 12
  • 2
    That was in my case. Increasing memory and cpu fixed it for me. How could the error be so obfuscated? I lost hours because of this. – tozka Aug 12 '20 at 20:35
  • This is ridiculous. For me it was because I put `memory: 300m` instead of `memory: 300Mi` – Phil Oct 05 '20 at 11:06
  • Event log in pods showing exactly same error, on VM that is running on legacy server. Should i overall increase VM CPU and memory or in some config file.? Please let me know thanks.! – surya kiran Apr 11 '22 at 19:39
7

I can see following message posted in Google Cloud Status Dashboard:

"We are investigating an issue affecting Google Container Engine (GKE) clusters where after docker crashes or is restarted on a node, pods are unable to be scheduled.

The issue is believed to be affecting all GKE clusters running Kubernetes v1.6.11, v1.7.8 and v1.8.1.

Our Engineering Team suggests: If nodes are on release v1.6.11, please downgrade your nodes to v1.6.10. If nodes are on release v1.7.8, please downgrade your nodes to v1.7.6. If nodes are on v1.8.1, please downgrade your nodes to v1.7.6.

Alternative workarounds are also provided by the Engineering team in this doc . These workarounds are applicable to the customers that are unable to downgrade their nodes."

Carlos
  • 966
  • 7
  • 15
  • Interesting. Nice catch, although I had this also at 1.7.6. I will try one of the workarounds and update! – Eldad Assis Oct 30 '17 at 19:21
  • Current status - I tried one of Google's workarounds. It did not help. I downgraded the cluster nodes to 1.7.6 (which I already had issues in). Seems to be better, but still unsure. – Eldad Assis Oct 31 '17 at 14:38
  • No luck. Still getting these errors. Google are rolling a fix, so I hope this helps. – Eldad Assis Nov 03 '17 at 18:42
  • 1
    Eldad AK, if you downgraded to 1.7.6 and are still seeing the issue, it may not be related to the incident. You should check the events and/or kubelet log to see if there are any errors starting/running the PodSandbox. – Yu-Ju Hong Nov 07 '17 at 18:22
  • Latest update - upgrading the cluster to 1.8.1-gke.1 seemed to have solved these issues (for now). It's been running several days without a single error related to my original post. – Eldad Assis Nov 12 '17 at 07:46
3

I was affected by same issue on one node in GKE 1.8.1 cluster (other nodes were fine). I did following:

  1. Make sure your node pool has some headroom to receive all pods scheduled on affected node. When in doubt, increase node pool by 1.
  2. Drain affected node following this manual:

    kubectl drain <node>
    

    You may run into warnings about daemonsets or pods with local storage, proceed with operation.

  3. Power down affected node in Compute Engine. GKE should schedule replacement node if your pool size is smaller than specified in pool description.

  • 3
    This is a good solution for a bad node, but my issues seem to happen on more than a single node. And they are not always at the same time, so it feels like a ghost hunt. – Eldad Assis Nov 02 '17 at 06:33
  • Sure, big clusters with multiple problem nodes will require too much manual work with this solution. I hope this answer helps someone with small cluster who happens to find this thread. –  Nov 03 '17 at 12:24