I have a Cloud Run for Anthos service running on my kubernetes cluster managed in Google Kubernetes Engine.
Out of nowhere all the services deployed stopped responding. The cause of the issue is that the service's pods' queue-proxy
container started crashing in loop.
I am not familiar with knative and I could not find anything similar on the internet related to GKE and thi specific container.
The logs of queue-proxy
container are really not helpful to me as I'm not familiar with knative:
{"level":"info","ts":1626091452.0552812,"logger":"fallback-logger","caller":"logging/config.go:78","msg":"Fetch GitHub commit ID from kodata failed","error":"open /var/run/ko/HEAD: no such file or directory"}
{"level":"info","ts":1626091452.055691,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:347","msg":"Queue container is starting with queue.BreakerParams{QueueDepth:800, MaxConcurrency:80, InitialCapacity:80}","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091452.0628664,"logger":"fallback-logger.queueproxy","caller":"metrics/exporter.go:160","msg":"Flushing the existing exporter before setting up the new exporter.","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091452.0656931,"logger":"fallback-logger.queueproxy","caller":"metrics/stackdriver_exporter.go:203","msg":"Created Opencensus Stackdriver exporter with config &{knative.dev/internal/serving revision stackdriver 60000000000 0x163f700 <nil> false 0 true knative.dev/internal/serving/revision custom.googleapis.com/knative.dev/revision { false}}","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091452.065773,"logger":"fallback-logger.queueproxy","caller":"metrics/exporter.go:173","msg":"Successfully updated the metrics exporter; old config: <nil>; new config &{knative.dev/internal/serving revision stackdriver 60000000000 0x163f700 <nil> false 0 true knative.dev/internal/serving/revision custom.googleapis.com/knative.dev/revision { false}}","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091453.9316235,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:234","msg":"Received TERM signal, attempting to gracefully shutdown servers.","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091453.9317243,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:236","msg":"Sleeping 45s to allow K8s propagation of non-ready state","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091498.93187,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:241","msg":"Shutting down main server","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091499.4330895,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:250","msg":"Shutting down server: admin","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091499.9333868,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:250","msg":"Shutting down server: metrics","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091500.433703,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:255","msg":"Shutdown complete, exiting...","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
It's restarting a few times before the Ready state changes to CrashLoopBackoff, which affects the pod's readiness and makes it unavailable. GKE creates another container for my actual application user-container
which is running properly the whole time. In time between the CrashLoopBackOff
and the several restarts, the service is reachable and works properly.
The cluster configuration did not change, I tried upgrading the version of the nodes but the issue remained.
I'm starting to think that I'm misled by this container and that the real cause is somewhere else, but I have no clue where to look as literally nothing has been done by me.
Do you have any suggestions on how to tackle this issue?
EDIT: The cluster was running on version 1.18.17-gke.1901
I then tried 1.19.9-gke.1700
and 1.20.7-gke.2200
but the issue remained
EDIT2: I just came across this in the changelog: Version 1.18.18-gke.1700 is no longer available in the Stable channel.
Could it be that because my cluster was running this version it was upgraded automatically?