I have several (<10) gke clusters, all but one are all in the same condition and I can't figure out what and why is it happening. I hope to find someone that managed to solve the same issue :)
Some time ago, i noticed that our HPA stopped working, having no way to read metrics from pods. Long story short, our pod named "metrics-server-v0.5.2-*" crashloops outputting a stack trace like this one: https://nopaste.net/aA5rwAeBWI and this one: https://nopaste.net/3uQsD6EDfA .
I tried to restart the deployment, without any success. What I don't understand is why one of the clusters (the oldest to be created) is working...
From r/googlecloud (reddit) it seems aknowledged that is not specifically tied to our workloads.
All the clusters are updated to the same version: v1.24.12-gke.500 and have been created by terraform resources.
Do any of you have any pointer?
Thanks.