0

I have a Cloud Run for Anthos service running on my kubernetes cluster managed in Google Kubernetes Engine.

Out of nowhere all the services deployed stopped responding. The cause of the issue is that the service's pods' queue-proxy container started crashing in loop.

I am not familiar with knative and I could not find anything similar on the internet related to GKE and thi specific container.

The logs of queue-proxy container are really not helpful to me as I'm not familiar with knative:


{"level":"info","ts":1626091452.0552812,"logger":"fallback-logger","caller":"logging/config.go:78","msg":"Fetch GitHub commit ID from kodata failed","error":"open /var/run/ko/HEAD: no such file or directory"}
{"level":"info","ts":1626091452.055691,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:347","msg":"Queue container is starting with queue.BreakerParams{QueueDepth:800, MaxConcurrency:80, InitialCapacity:80}","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091452.0628664,"logger":"fallback-logger.queueproxy","caller":"metrics/exporter.go:160","msg":"Flushing the existing exporter before setting up the new exporter.","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091452.0656931,"logger":"fallback-logger.queueproxy","caller":"metrics/stackdriver_exporter.go:203","msg":"Created Opencensus Stackdriver exporter with config &{knative.dev/internal/serving revision stackdriver 60000000000 0x163f700 <nil>  false 0  true knative.dev/internal/serving/revision custom.googleapis.com/knative.dev/revision {   false}}","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091452.065773,"logger":"fallback-logger.queueproxy","caller":"metrics/exporter.go:173","msg":"Successfully updated the metrics exporter; old config: <nil>; new config &{knative.dev/internal/serving revision stackdriver 60000000000 0x163f700 <nil>  false 0  true knative.dev/internal/serving/revision custom.googleapis.com/knative.dev/revision {   false}}","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091453.9316235,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:234","msg":"Received TERM signal, attempting to gracefully shutdown servers.","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091453.9317243,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:236","msg":"Sleeping 45s to allow K8s propagation of non-ready state","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091498.93187,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:241","msg":"Shutting down main server","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091499.4330895,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:250","msg":"Shutting down server: admin","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091499.9333868,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:250","msg":"Shutting down server: metrics","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}
{"level":"info","ts":1626091500.433703,"logger":"fallback-logger.queueproxy","caller":"queue/main.go:255","msg":"Shutdown complete, exiting...","knative.dev/key":"default/test-nginx-00002-kuw","knative.dev/pod":"test-nginx-00002-kuw-deployment-5b94b4c464-8pwdq"}

It's restarting a few times before the Ready state changes to CrashLoopBackoff, which affects the pod's readiness and makes it unavailable. GKE creates another container for my actual application user-container which is running properly the whole time. In time between the CrashLoopBackOff and the several restarts, the service is reachable and works properly.

The cluster configuration did not change, I tried upgrading the version of the nodes but the issue remained.

I'm starting to think that I'm misled by this container and that the real cause is somewhere else, but I have no clue where to look as literally nothing has been done by me.

Do you have any suggestions on how to tackle this issue?

EDIT: The cluster was running on version 1.18.17-gke.1901 I then tried 1.19.9-gke.1700 and 1.20.7-gke.2200 but the issue remained

EDIT2: I just came across this in the changelog: Version 1.18.18-gke.1700 is no longer available in the Stable channel. Could it be that because my cluster was running this version it was upgraded automatically?

Ganitzsh
  • 311
  • 2
  • 3
  • 12
  • 1
    Very likely depends on the cluster master version/track you are on. Maybe there was a bug and it's fixed. I recommend upgrading to a newer version if you're up for it. Also try adding your Knative version to the question. – ahmet alp balkan Jul 12 '21 at 22:58
  • That was my first thought and I changed to different versions but the problem remained. Not even a simple hello test service would work, I still can't figure out what caused the issue. I also noticed that the services were redeployed on their own the day before but were running fine so I didn't worry about it. How does Google manages the templates and updates of the clusters on a fixed version? Is Cloud Run the actual problem here? – Ganitzsh Jul 14 '21 at 09:03
  • 1
    I suggest a workaround - Try to disable and re-enable Cloud Run for Anthos.This [documentation](https://cloud.google.com/anthos/run/docs/setup) has steps for disable and re-enable Cloud Run for Anthos – Goli Nikitha Jul 28 '21 at 12:34

0 Answers0