I have an existing cluster running k8s version 1.12.8
on AWS EC2. The cluster contains several pods - some serving web traffic and others configured as scheduled CronJobs. The cluster has been running fine in it's current configuration for at least 6 months, with CronJobs running every 5 minutes.
Recently, the CronJobs simply stopped. Viewing the pods via kubectl
shows all the scheduled CronJobs last run was at roughly the same time. Logs sent to AWS Cloudwatch show no error output, and stop at the same time kubectl
shows for the last run.
In trying to diagnose this issue I have found a broader pattern of the cluster being unresponsive to changes, eg: I cannot retrieve logs or nodes via kubectl.
I deleted Pods in Replica Sets and they never return. I've set autoscale values on Replica Sets and nothing happens.
Investigation of the kubelet
logs on the master instance revealed repeating errors, coinciding with the time the failure was first noticed:
I0805 03:17:54.597295 2730 kubelet.go:1928] SyncLoop (PLEG): "kube-scheduler-ip-x-x-x-x.z-west-y.compute.internal_kube-system(181xxyyzz)", event: &pleg.PodLifecycleEvent{ID:"181xxyyzz", Type:"ContainerDied", Data:"405ayyzzz"}
...
E0805 03:18:10.867737 2730 kubelet_node_status.go:378] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"NetworkUnavailable\"},{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],"conditions\":[{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"Ready\"}]}}" for node "ip-172-20-60-88.eu-west-2.compute.internal": Patch https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z-west-y.compute.internal/status?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
...
E0805 03:18:20.869436 2730 kubelet_node_status.go:378] Error updating node status, will retry: error getting node "ip-172-20-60-88.eu-west-2.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-172-20-60-88.eu-west-2.compute.internal?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Running docker ps
on the master node shows that both k8s_kube-controller-manager_kube-controller-manager
and k8s_kube-scheduler_kube-scheduler
containers were started 6 days ago, where the other k8s containers are at 8+ months.
tl;dr
A container on my main node (likely kube-scheduler
, kube-controller-manager
or both) died. The containers have come back up but are unable to communicate with the existing nodes - this is preventing any scheduled CronJobs or new deployments from being satisfied.
How can re-configure kubelet and associated services on the master node to communicate again with the worker nodes?