3

We're using an HTTP(S) load balancer in front of our GKE backends with NEGs. Recently, we have created alerts in GCP Monitoring for 5xx Load Balancer errors with the loadbalancing.googleapis.com/https/backend_request_count metric and sometimes an alert is triggered with 500 errors even when we see no 500 errors on the application side (at least not in ~10 minutes time range).

Could this be an internal networking issue with the load balancer itself? Or what else can cause this? Maybe something in internally the GKE cluster? We checked the logs of the load balancer itself but didn't find any further detail that helps to resolve this.

Cristopher
  • 103
  • 4
  • 1
    You can take a look at [Cloud Debugger](https://cloud.google.com/debugger) which can help you to find the cause of 500 errors. – mario Apr 13 '21 at 22:14
  • Is this issue still persist? Which GKE version are you using? Could you provide more details about your environment/setup? Please share full error output. – PjoterS Nov 17 '21 at 13:12

2 Answers2

1

First, you're supposed to take a look at Google Cloud Logging and look for these error requests, in case you have enabled logging for your GKE cluster. This will give you more details about these failed requests.

Using Cloud Logging on GKE

Second, the recommended way is to instrument your application with Google's Cloud Trace and OpenTelemetry. This way you can create alerts, metrics, dashboards, and can even check that request and code block generated the error.

It's not a quick and easy task, but it's something extremely valuable for debugging purposes.

Please take a look at Strackdriver Trace

surfingonthenet
  • 715
  • 3
  • 7
  • Thank you for your help but we're already using Stackdriver Logging and we log 500 errors on the application side. That's why we don't understand that at the time of the alerts there is no error log in the application's log. There might be some delay because of the aggregation of metrics but I don't think that should be more than a couple of minutes. And if the issue is not on this side there might be something outside of the application itself but I don't know where to move on. – Richard Szabo Apr 19 '21 at 08:58
1

One likely scenario, based on the fact that you see no application-specific errors, is that your health checks may be periodically failing. I would first start by checking to ensure that health checks for your backend(s) are configured properly (URI, timeout, etc.). If all seems well and you don't already have it turned on, enable health check logging:

gcloud compute health-checks update PROTOCOL HEALTH_CHECK_NAME \
    --enable-logging

...and investigate the logs to see if there is a pattern to the failures (e.g. specific node, time of day, etc.):

logName="projects/PROJECT_ID/logs/compute.googleapis.com%2Fhealthchecks"
Garrett
  • 1,332
  • 10
  • 16