38

I was originally trying to run a Job that seemed to get stuck in a CrashBackoffLoop. Here was the service file:

apiVersion: batch/v1
kind: Job
metadata:
  name: es-setup-indexes
  namespace: elk-test
spec:
  template:
    metadata:
      name: es-setup-indexes
    spec:
      containers:
      - name: es-setup-indexes
        image: appropriate/curl
        command: ['curl -H  "Content-Type: application/json" -XPUT http://elasticsearch.elk-test.svc.cluster.local:9200/_template/filebeat -d@/etc/filebeat/filebeat.template.json']
        volumeMounts:
        - name: configmap-volume
          mountPath: /etc/filebeat/filebeat.template.json
          subPath: filebeat.template.json
      restartPolicy: Never

      volumes:
        - name: configmap-volume
          configMap:
            name: elasticsearch-configmap-indexes

I tried deleting the job but it would only work if I ran the following command:

kubectl delete job es-setup-indexes --cascade=false

After that I noticed when running:

kubectl get pods -w

I would get a TON of pods in an Error state and I see no way to clean them up. Here is just a small sample of the output when I run get pods:

es-setup-indexes-zvx9c   0/1       Error     0         20h
es-setup-indexes-zw23w   0/1       Error     0         15h
es-setup-indexes-zw57h   0/1       Error     0         21h
es-setup-indexes-zw6l9   0/1       Error     0         16h
es-setup-indexes-zw7fc   0/1       Error     0         22h
es-setup-indexes-zw9bw   0/1       Error     0         12h
es-setup-indexes-zw9ck   0/1       Error     0         1d
es-setup-indexes-zwf54   0/1       Error     0         18h
es-setup-indexes-zwlmg   0/1       Error     0         16h
es-setup-indexes-zwmsm   0/1       Error     0         21h
es-setup-indexes-zwp37   0/1       Error     0         22h
es-setup-indexes-zwzln   0/1       Error     0         22h
es-setup-indexes-zx4g3   0/1       Error     0         11h
es-setup-indexes-zx4hd   0/1       Error     0         21h
es-setup-indexes-zx512   0/1       Error     0         1d
es-setup-indexes-zx638   0/1       Error     0         17h
es-setup-indexes-zx64c   0/1       Error     0         21h
es-setup-indexes-zxczt   0/1       Error     0         15h
es-setup-indexes-zxdzf   0/1       Error     0         14h
es-setup-indexes-zxf56   0/1       Error     0         1d
es-setup-indexes-zxf9r   0/1       Error     0         16h
es-setup-indexes-zxg0m   0/1       Error     0         14h
es-setup-indexes-zxg71   0/1       Error     0         1d
es-setup-indexes-zxgwz   0/1       Error     0         19h
es-setup-indexes-zxkpm   0/1       Error     0         23h
es-setup-indexes-zxkvb   0/1       Error     0         15h
es-setup-indexes-zxpgg   0/1       Error     0         20h
es-setup-indexes-zxqh3   0/1       Error     0         1d
es-setup-indexes-zxr7f   0/1       Error     0         22h
es-setup-indexes-zxxbs   0/1       Error     0         13h
es-setup-indexes-zz7xr   0/1       Error     0         12h
es-setup-indexes-zzbjq   0/1       Error     0         13h
es-setup-indexes-zzc0z   0/1       Error     0         16h
es-setup-indexes-zzdb6   0/1       Error     0         1d
es-setup-indexes-zzjh2   0/1       Error     0         21h
es-setup-indexes-zzm77   0/1       Error     0         1d
es-setup-indexes-zzqt5   0/1       Error     0         12h
es-setup-indexes-zzr79   0/1       Error     0         16h
es-setup-indexes-zzsfx   0/1       Error     0         1d
es-setup-indexes-zzx1r   0/1       Error     0         21h
es-setup-indexes-zzx6j   0/1       Error     0         1d
kibana-kq51v   1/1       Running   0         10h

But if I look at the jobs I get nothing related to that anymore:

$ kubectl get jobs --all-namespaces                                                                              
NAMESPACE     NAME               DESIRED   SUCCESSFUL   AGE
kube-system   configure-calico   1         1            46d

I've also noticed that kubectl seems much slow to respond. I don't know if the pods are continuously trying to be restarted or in some broken state but would be great if someone could let me know how to troubleshoot as I have not come across another issue like this in kubernetes.

Kube info:

$ kubectl version 
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:33:27Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
xamox
  • 2,599
  • 4
  • 27
  • 30
  • what about output of: $kubectl describe pods – turkenh Jun 06 '17 at 00:59
  • How have you tried to delete the pods? What do you mean by 'it would only work with `--cascade=false`'? Was there an error? – johnharris85 Jun 06 '17 at 01:30
  • @turkenh so I did end up running that command. I saw what nodes they ran on, ssh'd into those and manually deleted all the old docker images that matched that image with docker ps -a. After deleting the old containers it seems that kubectl still reports them even though I have manually deleted them. I don't know if I should maybe try to spin up more nodes and do a migration to a new node and tear down the old one, or if there is way to figure out how to get kube to sync back up with the state of docker. – xamox Jun 06 '17 at 03:25
  • @johnharris85 Ahh, thanks that worked deleting them all manually. It took about 2 hours as there were 9292 error'd out pods. – xamox Jun 06 '17 at 12:25
  • https://github.com/kubernetes/kubernetes/issues/53331 – kivagant Mar 17 '18 at 03:10

5 Answers5

98

kubectl delete pods --field-selector status.phase=Failed -n <your-namespace>

...cleans up any failed pods in your-namespace.

Kevin Pedersen
  • 981
  • 6
  • 2
71

Here you are a quick way to fix it :)

kubectl get pods | grep Error | cut -d' ' -f 1 | xargs kubectl delete pod

Edit: Add flag -a if you are using an old version of k8s

marcostvz
  • 1,148
  • 11
  • 17
7

I had many pods which were stuck with following status

  • ContainerCannotRun
  • Error
  • ImagePullBackOff

For valid reasons, these pods were in above state. But they didn't get cleaned up automatically even after the issue got resolved later.

To clean up, manually following things didn't work:

# Doesn't work
kubectl get pods --field-selector status.phase=Error 

# Doesn't work
kubectl get pods \
    --field-selector=status.phase=Error

# Doesn't work
kubectl get pods \
    --field-selector="status.phase=Error"

# Doesn't work
kubectl get pods \
    --field-selector="status.phase==Error"


Following approach of filtering the pods with status that we like to retain works perfectly

# Validate list of pods.
# Please add more status that we don't want to delete
kubectl get pods \
    --field-selector="status.phase!=Succeeded,status.phase!=Running"

# Delete pods that matches the filter
kubectl delete pods \
    --field-selector="status.phase!=Succeeded,status.phase!=Running"

Sairam Krish
  • 10,158
  • 3
  • 55
  • 67
6

I usually remove all the Error pods with this command. kubectl delete pod `kubectl get pods --namespace <yournamespace> | awk '$3 == "Error" {print $1}'` --namespace <yournamespace>

Ahmed Hosny
  • 1,162
  • 10
  • 21
1

The solution was as @johnharris85 mentioned in the comment. I had to manually delete all the pods. To do that I ran the following:

kubectl get pods -w | tee all-pods.txt

That dumped all my pods, then to filter and delete on only what I wanted.

kubectl delete pod $(more all-pods.txt | grep es-setup-index | awk '{print $1}')

Note: I had about 9292 pods, it took about 1-2 hours to delete them all.

xamox
  • 2,599
  • 4
  • 27
  • 30