How to debug failed requests with client_disconnected_before_any_response

Question

We have an HTTP(s) Load Balancer created by a kubernetes ingress, which points to a backend formed by set of pods running nginx and Ruby on Rails.

Taking a look to the load balancer logs we have detected an increasing number of requests with a response code of 0 and statusDetails = client_disconnected_before_any_response.

We're trying to understand why this his happening, but we haven't found anything relevant. There is nothing in the nginx access or error logs.

This is happening for multiple kind of requests, from GET to POST.

We also suspect that sometimes despite of the request being logged with that error, the requests is actually passed to the backend. For instance we're seeing PG::UniqueViolation errors, due to idential sign up requests being sent twice to the backend in our sign up endpoint.

Any kind of help would be appreciated. Thanks!

UPDATE 1

As requested here is the yaml file for the ingress resource:

UPDATE 2

I've created a log-based Stackdriver metric, to count the number of requests that present this behavior. Here is the chart:

The big peaks approximately match the timestamp for these kubernetes events:

Full error: Readiness probe failed: Get http://10.48.1.28:80/health_check: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

So it seems sometimes the readiness probe for the pods behind the backend fails, but not always.

Here is the definition of the readinessProbe

readinessProbe:
  failureThreshold: 3
  httpGet:
    httpHeaders:
    - name: X-Forwarded-Proto
      value: https
    - name: Host
      value: [redacted]
    path: /health_check
    port: 80
    scheme: HTTP
  initialDelaySeconds: 1
  periodSeconds: 30
  successThreshold: 1
  timeoutSeconds: 5

Could you add the yaml for your load balancer to your question? — mstorkson, Sep 10 '18 at 17:11
Seems like your backend cannot talk back to your client. Any outgoing traffic rules on your cluster? — Rico, Sep 10 '18 at 17:15
@rico note this is not happening for all the requests, is only for a few of them, (I'd say less than 1% or 5% of them), so I don't think it is a problem with outgoing traffic firewall rules. — jävi, Sep 10 '18 at 17:17
Do these events become more frequent with heavier traffic? How many pods do you have serviced by this load balancer? How is your node for free memory/cpu? — mstorkson, Sep 10 '18 at 17:20
Could this simply be people using weird HTTP clients or crawler bots that occasionally visit your website? Do you measure any user-visible impact to your website? — ahmet alp balkan, Sep 10 '18 at 23:46
@mstorkson it's hard to tell whether is related to heavier traffic. We have 6 pods for this backend distributed across two nodes. Right now CPU is at <10% for both nodes and memory around 80%. — jävi, Sep 11 '18 at 07:46
@rico how could I tell whether the backend is dropping replies? I can't see anything in the nginx logs. — jävi, Sep 11 '18 at 07:47
@AhmetAlpBalkan-Google nope, it's happening for valid request from normal users. The main user visible impact we have found is when it happens for the sign up endpoint. The fill in the sign up form and then somehow the first request gets processed by discarded by the load balancer, and the second request triggers a PostgreSQL error due to the email already being registered. — jävi, Sep 11 '18 at 07:48

score 1 · Answer 1 · answered Sep 12 '18 at 01:31

1

A response code of 0 and statusDetails = client_disconnected_before_any_response means the client closed the connection before the Load Balancer being able to provide a response as per this GCP documentation.

Investigating why it did not respond in time, one of the reasons could be the difference between the keepalive timeouts from nginx and the GCP Load Balancer, even if this will most-like provide a backend_connection_closed_before_data_sent_to_client caused by a 502 Bad Gateway race condition.

To make sure the backend responds to the request and to see if how long it takes, you can repeat this process for a couple of times (since you still get some valid responses):

curl response time

$ curl -w "@curl.txt" -o /dev/null -s IP_HERE

curl.txt content(create and save this file first):

   time_namelookup:  %{time_namelookup}\n
      time_connect:  %{time_connect}\n
   time_appconnect:  %{time_appconnect}\n
  time_pretransfer:  %{time_pretransfer}\n
     time_redirect:  %{time_redirect}\n
time_starttransfer:  %{time_starttransfer}\n
                ----------\n
        time_total:  %{time_total}\n

If this is the case, please review the sign up endpoint code for any type of loop like the PG::UniqueViolation errors that you mentioned.

answered Sep 12 '18 at 01:31

Daniel

88
5

Thank you for your answer @Daniel. I was actually familiar with the [article you mentioned](https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340), and I actually have `keepalive_timeout 650;` and `keepalive_requests 10000;` in my nginx config. I also tried increasing the timeout for the GCP Backend from the default 30s to 60s, but the problem remains. In other words, I doubt it's a timeout problem :( – jävi Sep 12 '18 at 08:34
@jävi Thank you for clearing out the timeout possibility. Checking your updated question I want to know how important to your application is the `httpHeaders:`. – Daniel Sep 12 '18 at 20:45
So [`X-Forwarded-Proto`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Forwarded-Proto) is there because our app uses HTTPS but the SSL gets resolved at the load balancer level. With this header we tell our app (nginx and later ruby on rails) that the requests was originally HTTPS. The `Host` header is there to make sure nginx replies to that request, as it's currently limited to respond requests from our production domain. – jävi Sep 13 '18 at 09:02
Readiness probes needs to be as close as possible of the app mechanics to have an accurate reflection if it is is able to answer the requests. But in case the app gets really stuck it will be the liveness probe that will restart your pod. Based in the events list, I would expect to see some liveness checks failing (unless filtered). I hope [this article](https://cloudplatform.googleblog.com/2018/05/Kubernetes-best-practices-Setting-up-health-checks-with-readiness-and-liveness-probes.html), that helped me a lot to understand how coordinate these two probes, shed some light on the investigation. – Daniel Sep 15 '18 at 03:14
"means that the client closed the connection" -- note that this includes aborting requests. For example, if you call [`XHR.abort()`](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/abort) or if the browser automatically aborts downloading a static asset because of a navigation. – ZachB Aug 14 '20 at 01:38

How to debug failed requests with client_disconnected_before_any_response

UPDATE 1

UPDATE 2

1 Answers1

Linked