Random failed_to_connect_to_backend errors on GCE LB

Question

I made a simple setup with two GCE instances behind a load balancer. But in the balancer logs, I can see random 502 responses with the following error: "failed_to_connect_to_backend"

Al thought the last health check was fine with 200 response, and checking my nginx logs shows that the request didn't even get though the backend to nginx.

I'm unable to know what the issue is, are there any sort of logs showing why it failed to connect to the backend? is it a a health check issue? are there any health checks logs?

What type of LB are have configured? If using HTTP(s) you may try increasing the [timeout value](https://cloud.google.com/compute/docs/load-balancing/http/backend-service#backend_service_components). Also, possible that backend may be overloaded and fails to serve the traffic, verify instance resource usage. If HTTP(s) LB, verify URL mapping is configured [correctly](https://cloud.google.com/compute/docs/load-balancing/http/url-map). Also, can try increasing Timeout value of the backend to a higher value. Currently, GCP does not have an option to view health check request to the instances. — N Singh, Nov 06 '17 at 18:29
Thanks Navi, It's HTTPs and I'm trying to isolate the issue, and I thought having a log of HC requests can help immensly... What I'm trying to do now is to do a health check to a static html file for the sake of debugging in a separate server (i'm using tomcat, and HC goes to nginx) in order to see if the this cause things to work fine since HC will always be healthy. — Sari Alalem, Nov 06 '17 at 20:06
Note that its HTTP(s) LB and health check does not use URL map rules to check backends and therefore a healthy status does not mean LB is configured correctly. Make sure [URL mapping](https://cloud.google.com/compute/docs/load-balancing/http/url-map) is configured correctly. You can also try increasing the [timeout value](https://cloud.google.com/compute/docs/load-balancing/http/backend-service#backend_service_components) on the LB. By default, for backend it is set to 30 seconds. In addition, monitor the instance resources usage(memory, CPU etc.). — N Singh, Nov 07 '17 at 14:48
Thanks Navi, you are right, it looks like an issue in the url I'm using as a health check, because I've used a dummy html file url for the health check for a while, and the issue doesn't happen. — Sari Alalem, Nov 08 '17 at 08:05
So I guess I have to figure what's going on in my system for that url... now to close this Question. — Sari Alalem, Nov 08 '17 at 08:06

Julius Žaromskis · Answer 1 · 2018-08-08T12:11:25.503

Have you configured keep alive timeout correctly?

A TCP session timeout, whose value is fixed at 10 minutes (600 seconds). This session timeout is sometimes called a keepalive or idle timeout, and its value is not configurable by modifying your backend service. You must configure the web server software used by your backends so that its keepalive timeout is longer than 600 seconds to prevent connections from being closed prematurely by the backend.

This is now in official GCP documentation. Recommended setting for nginx: KeepAliveTimeout 620. Recommended setting for Apache: keepalive_timeout 620s.

https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340

https://cloud.google.com/compute/docs/load-balancing/http/

https://cloud.google.com/load-balancing/docs/https/#timeouts_and_retries

I wonder how this works when I'm not explicitly using Nginx though? I am seeing this when I use GKE without any Nginx in the mix. How does the gke ingress in kube work with this? — Randy L, Mar 29 '18 at 19:47

Random failed_to_connect_to_backend errors on GCE LB

1 Answers1