0

i have upgraded an EKS cluster to 1.23, and - apparently - since then all the Gitlab runner Pods (K8S executor) don't really terminate, but stay in "Terminating" state seemingly forever. If I explicitly delete them using kubectl they "leave".

Some observations & facts:

  • the pods don't have volumes attached
  • K9S shows the containers within the Pod in "ERROR" state, but gives no indication as to why.
  • there is no finalizer attached (the one I see on the terminating Pods, foregroundDeletion, just seems to indicate that this Pod is being deleted right now
  • GitLab does not show errors, neither in the Pod logs nor in the UI; the CI jobs complete successfully

A lot of discussions hint to finalizers, but again: there are none attached, and also, as I understand it, only removing the finalizers would make them finally terminate and go, and that is not the behavior I see. I also found this SO question, but it has not helped me diagnosing or solving the issue.

I don't really know where to keep looking. Apart from GitLab runners everything seems to run just fine, but the configuration of them did not change.

Any help, here? Greatly appreciated.

UPDATE: I now realized that apparently the pod is gone on the underlying machine, at least there's no trace in docker ps. Also this night two hosts rotated, and a bunch of Pods are still showing as being on those hosts (which don't exist any more).

Even more puzzled.

UPDATE II: managed to grab a /var/log/messages log from the underlying host for a Terminated container, see below.

screenshot from k9s - a lot of pods in terminating state

screenshot from k9s - the containers within the terminating pod are in ERROR state

# /var/log/messages | grep ...
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.823617    3693 kubelet.go:2120] "SyncLoop ADD" source="api" pods=[gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq]
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946041    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946155    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946217    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946241    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946279    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946307    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047362    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047460    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047533    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047570    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047643    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047699    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047815    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.048095    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.048277    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.050600    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.061048    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.062031    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.159234    3693 kuberuntime_manager.go:487] "No sandbox for pod can be found. Need to start a new one" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.835414    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar  8 09:30:47 ip-10-0-136-6 kubelet: I0308 09:30:47.873037    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar  8 09:30:47 ip-10-0-136-6 kubelet: I0308 09:30:47.873611    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:98f1518e3d2f4ba7d66e1fd5238e49e928f81c5fe044e7bcbf408ec98ff7e45c}
Mar  8 09:30:48 ip-10-0-136-6 kubelet: I0308 09:30:48.926561    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc}
Mar  8 09:30:48 ip-10-0-136-6 kubelet: I0308 09:30:48.926604    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d}
Mar  8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.802812    3693 kubelet.go:2136] "SyncLoop DELETE" source="api" pods=[gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq]
Mar  8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.804708    3693 kuberuntime_container.go:723] "Killing container with a grace period" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" podUID=08eb951f-8b11-44b2-b106-42719ecff1ba containerName="build" containerID="docker://7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d" gracePeriod=1
Mar  8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.805281    3693 kuberuntime_container.go:723] "Killing container with a grace period" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" podUID=08eb951f-8b11-44b2-b106-42719ecff1ba containerName="helper" containerID="docker://c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc" gracePeriod=1
Mar  8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705701    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc}
Mar  8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705852    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d}
Mar  8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705876    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar  8 09:37:41 ip-10-0-136-6 kubelet: I0308 09:37:41.984526    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7co
flypenguin
  • 655
  • 2
  • 7
  • 21

1 Answers1

0

I ran into the exact same issue with the same symptoms on EKS 1.24 following the upgrade of Calico's tigera-operator to v3.26.0 (installed with the aws-cni as recommended by Amazon).

I was able to identify the tigera-operator upgrade as the culprit by looking at the kube-state-metrics of the cluster. Via the following query:

kube_pod_status_phase{phase="Failed", namespace="gitlab-runner"}

There was a large uptick in pods in the failed state after the upgrade. Assuming that you have network policies enabled in your cluster via calico, this is likely the reason.

Downgrading tigera-operator to v3.25.1 compeltely resolves the issue. Make sure to wait for the calico-node daemonset to completely to be completely up-to-date, as without manual action the pods are restarted one by one.

There is a relevant gitlab issue where this problem may be analyzed further. Before identifying the calico/tigera-operator as the culprit I tried all minor (and some patch) versions of the gitlab-runners chart from 0.52.1 back to `0.47.0 and even reviewed the changes, but that was definitely not the reason for the behavior. This appears to be an upstream calico bug which is triggered by the way in which the gitlab runner pods are terminated.

vaskozl
  • 31
  • 2