Kubernetes has a ton of pods in error state that can't seem to be cleared

Question

I was originally trying to run a Job that seemed to get stuck in a CrashBackoffLoop. Here was the service file:

apiVersion: batch/v1
kind: Job
metadata:
  name: es-setup-indexes
  namespace: elk-test
spec:
  template:
    metadata:
      name: es-setup-indexes
    spec:
      containers:
      - name: es-setup-indexes
        image: appropriate/curl
        command: ['curl -H  "Content-Type: application/json" -XPUT http://elasticsearch.elk-test.svc.cluster.local:9200/_template/filebeat -d@/etc/filebeat/filebeat.template.json']
        volumeMounts:
        - name: configmap-volume
          mountPath: /etc/filebeat/filebeat.template.json
          subPath: filebeat.template.json
      restartPolicy: Never

      volumes:
        - name: configmap-volume
          configMap:
            name: elasticsearch-configmap-indexes

I tried deleting the job but it would only work if I ran the following command:

kubectl delete job es-setup-indexes --cascade=false

After that I noticed when running:

kubectl get pods -w

I would get a TON of pods in an Error state and I see no way to clean them up. Here is just a small sample of the output when I run get pods:

es-setup-indexes-zvx9c   0/1       Error     0         20h
es-setup-indexes-zw23w   0/1       Error     0         15h
es-setup-indexes-zw57h   0/1       Error     0         21h
es-setup-indexes-zw6l9   0/1       Error     0         16h
es-setup-indexes-zw7fc   0/1       Error     0         22h
es-setup-indexes-zw9bw   0/1       Error     0         12h
es-setup-indexes-zw9ck   0/1       Error     0         1d
es-setup-indexes-zwf54   0/1       Error     0         18h
es-setup-indexes-zwlmg   0/1       Error     0         16h
es-setup-indexes-zwmsm   0/1       Error     0         21h
es-setup-indexes-zwp37   0/1       Error     0         22h
es-setup-indexes-zwzln   0/1       Error     0         22h
es-setup-indexes-zx4g3   0/1       Error     0         11h
es-setup-indexes-zx4hd   0/1       Error     0         21h
es-setup-indexes-zx512   0/1       Error     0         1d
es-setup-indexes-zx638   0/1       Error     0         17h
es-setup-indexes-zx64c   0/1       Error     0         21h
es-setup-indexes-zxczt   0/1       Error     0         15h
es-setup-indexes-zxdzf   0/1       Error     0         14h
es-setup-indexes-zxf56   0/1       Error     0         1d
es-setup-indexes-zxf9r   0/1       Error     0         16h
es-setup-indexes-zxg0m   0/1       Error     0         14h
es-setup-indexes-zxg71   0/1       Error     0         1d
es-setup-indexes-zxgwz   0/1       Error     0         19h
es-setup-indexes-zxkpm   0/1       Error     0         23h
es-setup-indexes-zxkvb   0/1       Error     0         15h
es-setup-indexes-zxpgg   0/1       Error     0         20h
es-setup-indexes-zxqh3   0/1       Error     0         1d
es-setup-indexes-zxr7f   0/1       Error     0         22h
es-setup-indexes-zxxbs   0/1       Error     0         13h
es-setup-indexes-zz7xr   0/1       Error     0         12h
es-setup-indexes-zzbjq   0/1       Error     0         13h
es-setup-indexes-zzc0z   0/1       Error     0         16h
es-setup-indexes-zzdb6   0/1       Error     0         1d
es-setup-indexes-zzjh2   0/1       Error     0         21h
es-setup-indexes-zzm77   0/1       Error     0         1d
es-setup-indexes-zzqt5   0/1       Error     0         12h
es-setup-indexes-zzr79   0/1       Error     0         16h
es-setup-indexes-zzsfx   0/1       Error     0         1d
es-setup-indexes-zzx1r   0/1       Error     0         21h
es-setup-indexes-zzx6j   0/1       Error     0         1d
kibana-kq51v   1/1       Running   0         10h

But if I look at the jobs I get nothing related to that anymore:

$ kubectl get jobs --all-namespaces                                                                              
NAMESPACE     NAME               DESIRED   SUCCESSFUL   AGE
kube-system   configure-calico   1         1            46d

I've also noticed that kubectl seems much slow to respond. I don't know if the pods are continuously trying to be restarted or in some broken state but would be great if someone could let me know how to troubleshoot as I have not come across another issue like this in kubernetes.

Kube info:

$ kubectl version 
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:33:27Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

How have you tried to delete the pods? What do you mean by 'it would only work with `--cascade=false`'? Was there an error? — johnharris85, Jun 06 '17 at 01:30
@turkenh so I did end up running that command. I saw what nodes they ran on, ssh'd into those and manually deleted all the old docker images that matched that image with docker ps -a. After deleting the old containers it seems that kubectl still reports them even though I have manually deleted them. I don't know if I should maybe try to spin up more nodes and do a migration to a new node and tear down the old one, or if there is way to figure out how to get kube to sync back up with the state of docker. — xamox, Jun 06 '17 at 03:25
@johnharris85 Ahh, thanks that worked deleting them all manually. It took about 2 hours as there were 9292 error'd out pods. — xamox, Jun 06 '17 at 12:25

score 98 · Answer 1 · answered Nov 15 '18 at 20:21

98

kubectl delete pods --field-selector status.phase=Failed -n <your-namespace>

...cleans up any failed pods in your-namespace.

answered Nov 15 '18 at 20:21

Kevin Pedersen

981
6
2

2

Try to use --field-selector=status.phase=Failed – user459118 Dec 11 '18 at 13:09
2

`kubectl get pods -o name -n --field-selector status.phase=Failed | xargs kubectl delete -n ` – BdN3504 Mar 14 '19 at 16:44

marcostvz · Answer 2 · 2019-07-25T13:38:55.610

71

Here you are a quick way to fix it :)

kubectl get pods | grep Error | cut -d' ' -f 1 | xargs kubectl delete pod

Edit: Add flag -a if you are using an old version of k8s

edited Jul 25 '19 at 13:38

answered Feb 21 '18 at 15:36

marcostvz

1,148
11
17

Thanks. Yeah I should have used xargs before so it was done in parallel vs. serial. – xamox Feb 21 '18 at 19:06
2

Bear in mind this would also delete any pods which happen to have Error in their titles. The answer below is more reliable. – Benjamin Gorman Feb 13 '20 at 15:32

score 7 · Answer 3 · answered Jan 07 '22 at 09:53

I had many pods which were stuck with following status

ContainerCannotRun
Error
ImagePullBackOff

For valid reasons, these pods were in above state. But they didn't get cleaned up automatically even after the issue got resolved later.

To clean up, manually following things didn't work:

# Doesn't work
kubectl get pods --field-selector status.phase=Error 

# Doesn't work
kubectl get pods \
    --field-selector=status.phase=Error

# Doesn't work
kubectl get pods \
    --field-selector="status.phase=Error"

# Doesn't work
kubectl get pods \
    --field-selector="status.phase==Error"

Following approach of filtering the pods with status that we like to retain works perfectly

# Validate list of pods.
# Please add more status that we don't want to delete
kubectl get pods \
    --field-selector="status.phase!=Succeeded,status.phase!=Running"

# Delete pods that matches the filter
kubectl delete pods \
    --field-selector="status.phase!=Succeeded,status.phase!=Running"

score 6 · Answer 4 · answered Jun 06 '18 at 09:21

6

I usually remove all the Error pods with this command. kubectl delete pod `kubectl get pods --namespace <yournamespace> | awk '$3 == "Error" {print $1}'` --namespace <yournamespace>

answered Jun 06 '18 at 09:21

Ahmed Hosny

1,162
10
21

Thanks. I have not experienced this issue since upgrading to 1.7+. – xamox Jun 07 '18 at 14:51

score 1 · Answer 5 · answered Jun 06 '17 at 12:29

The solution was as @johnharris85 mentioned in the comment. I had to manually delete all the pods. To do that I ran the following:

kubectl get pods -w | tee all-pods.txt

That dumped all my pods, then to filter and delete on only what I wanted.

kubectl delete pod $(more all-pods.txt | grep es-setup-index | awk '{print $1}')

Note: I had about 9292 pods, it took about 1-2 hours to delete them all.

Kubernetes has a ton of pods in error state that can't seem to be cleared

5 Answers5

Linked