Long request returns with empty response after 120 seconds, caused by Network Load Balancer

Question

I have a GKE cluster with 2 nodes, with a service of type LoadBalancer. When I call the service internally a long request will not timeout after 120 seconds. But if I call the external IP of the Network Load Balancer that forwards to the internal service, I get a "Empty reply from server" response.

External call example:

curl -v "http://<public-ip>/longResponse"
*   Trying <public-ip>...
* TCP_NODELAY set
* Connected to <public-ip> (<public-ip>) port 80 (#0)
> GET /longResponse HTTP/1.1
> Host: <public-ip>
> User-Agent: curl/7.54.0
> Accept: */*
>
* Empty reply from server
* Connection #0 to host <public-ip> left intact
curl: (52) Empty reply from server

Internal call example:

/ # wget -O - -S <service-name>/longResponse
Connecting to location-service (10.3.255.181:80)
  HTTP/1.1 200 OK
  Access-Control-Allow-Origin: *
  Content-Type: application/json
  Content-Length: 15
  Date: Thu, 28 Feb 2019 10:31:14 GMT
  Connection: close

-                    100% |*********************************************************************************************************************************************************************************************************************|    15  0:00:00 ETA
/ #

I've tried to find documentation for request or socket timeout in the load balancer level, but I didn't encounter anything. Any idea?

Thanks.

show your kubernetes service, maybe your service doesn't has any endpoints — c4f4t0r, Feb 28 '19 at 11:06
@c4f4t0r Obviously did not mention that this service is running for a long time now successfully, only if I have a long request over 120 seconds then it returns an empty response, all other requests work just fine. — ilaif, Mar 01 '19 at 14:04
@c4f4t0r curl uses a 60 seconds keepalive by default, but anyhow reduced it to 15 seconds just to try - same result.. — ilaif, Mar 01 '19 at 15:23

score 0 · Answer 1 · answered Feb 28 '19 at 17:52

0

Are you sure that's not a client-side timeout? Network LB doesn't process packets other than to route them, so it should never send any response back.

Try the -m flag to curl?

Also maybe capture a tcpdump on your client-side so you can see what the network is actually doing.

answered Feb 28 '19 at 17:52

Tim Hockin

3,567
13
18

Thanks! tried using -m option but it still returns empty response after exactly 120 seconds. I tried tcpdump, there's a [FIN, ACK] from the host after exactly 120 seconds. So, as suspected something in the chain is timing after 120 seconds, but it's not the K8S service of the application server since I've checked that already. – ilaif Mar 01 '19 at 15:16
...and you're SURE this doesn't happen when you access the service IP directly? There are some conntrack control knobs that might correspond, but they should apply equally to service IPs and load balancers. Look at `/proc/sys/net/netfilter/nf_conntrack_tcp_timeout_*` on your node, maybe try changing some of those to unique values and measure the impact. Once we know if and which of these are in play, we can talk about a systemic solution. – Tim Hockin Mar 01 '19 at 20:45
Yes I'm sure! Looking at /proc is interesting: ..._close: 10, ..._close_wait: 3600, ..._established: 86400, ..._fin_wait: *120*, ..._last_ack: 30, ..._max_retrans: 300, ..._syn_recv: 60, ..._syn_sent: *120*, ..._time_wait: *120*, ..._unacknowledged: 300 So one of `syn_sent`, `time_wait`, `fin_wait` is causing this? I saw in TCPDUMP the FIN message, so `fin_wait` is a suspect. How do I go about changing this in GKE? – ilaif Mar 02 '19 at 14:44
You can manually log on to each node and set it for testing, or you can run a privileged daemonset (if you want to keep the results). I would try setting each of those 120's to a different number, just to be sure. But I cannot explain why this would affect the LB and not service VIPs, so I am still skeptical. – Tim Hockin Mar 03 '19 at 02:58

score 0 · Answer 2 · answered Feb 28 '19 at 18:14

Get the load-balancer's backend name with:

gcloud compute backend-services list

then

BACKEND=name-of-your-backend
gcloud compute backend-services update $BACKEND --timeout=600s

otherwise, in the console: Network services ⇒ Load balancing ⇒ Backends then you can click your HTTP backend(s) and edit the settings, including the timeout.

On a wider note, this may be one of serval hops between server and client, each of which might timeout. You're better off either living with the timeout (and making your long polls complete before the timeout), or drip feeding data down the line... for instance, you can preprend whitespace to json, so for instance, send a space character every 30 seconds until you have a proper response body. This will keep the load-balance from timing out.

`gcloud compute backend-services list` returns 0 results. and this matches `gcloud compute project-info describe --project PROJECT_NAME` result which says usage: 0. However, I have k8s services up and running in the same project. I've created everything through the GKE console. Could that be the reason? — ilaif, Mar 01 '19 at 14:19

Long request returns with empty response after 120 seconds, caused by Network Load Balancer

2 Answers2