2

I have a GKE cluster with 2 nodes, with a service of type LoadBalancer. When I call the service internally a long request will not timeout after 120 seconds. But if I call the external IP of the Network Load Balancer that forwards to the internal service, I get a "Empty reply from server" response.

External call example:

curl -v "http://<public-ip>/longResponse"
*   Trying <public-ip>...
* TCP_NODELAY set
* Connected to <public-ip> (<public-ip>) port 80 (#0)
> GET /longResponse HTTP/1.1
> Host: <public-ip>
> User-Agent: curl/7.54.0
> Accept: */*
>
* Empty reply from server
* Connection #0 to host <public-ip> left intact
curl: (52) Empty reply from server

Internal call example:

/ # wget -O - -S <service-name>/longResponse
Connecting to location-service (10.3.255.181:80)
  HTTP/1.1 200 OK
  Access-Control-Allow-Origin: *
  Content-Type: application/json
  Content-Length: 15
  Date: Thu, 28 Feb 2019 10:31:14 GMT
  Connection: close

-                    100% |*********************************************************************************************************************************************************************************************************************|    15  0:00:00 ETA
/ # 

I've tried to find documentation for request or socket timeout in the load balancer level, but I didn't encounter anything. Any idea?

Thanks.

ilaif
  • 348
  • 2
  • 13

2 Answers2

0

Are you sure that's not a client-side timeout? Network LB doesn't process packets other than to route them, so it should never send any response back.

Try the -m flag to curl?

Also maybe capture a tcpdump on your client-side so you can see what the network is actually doing.

Tim Hockin
  • 3,567
  • 13
  • 18
  • Thanks! tried using -m option but it still returns empty response after exactly 120 seconds. I tried tcpdump, there's a [FIN, ACK] from the host after exactly 120 seconds. So, as suspected something in the chain is timing after 120 seconds, but it's not the K8S service of the application server since I've checked that already. – ilaif Mar 01 '19 at 15:16
  • ...and you're SURE this doesn't happen when you access the service IP directly? There are some conntrack control knobs that might correspond, but they should apply equally to service IPs and load balancers. Look at `/proc/sys/net/netfilter/nf_conntrack_tcp_timeout_*` on your node, maybe try changing some of those to unique values and measure the impact. Once we know if and which of these are in play, we can talk about a systemic solution. – Tim Hockin Mar 01 '19 at 20:45
  • Yes I'm sure! Looking at /proc is interesting: ..._close: 10, ..._close_wait: 3600, ..._established: 86400, ..._fin_wait: *120*, ..._last_ack: 30, ..._max_retrans: 300, ..._syn_recv: 60, ..._syn_sent: *120*, ..._time_wait: *120*, ..._unacknowledged: 300 So one of `syn_sent`, `time_wait`, `fin_wait` is causing this? I saw in TCPDUMP the FIN message, so `fin_wait` is a suspect. How do I go about changing this in GKE? – ilaif Mar 02 '19 at 14:44
  • You can manually log on to each node and set it for testing, or you can run a privileged daemonset (if you want to keep the results). I would try setting each of those 120's to a different number, just to be sure. But I cannot explain why this would affect the LB and not service VIPs, so I am still skeptical. – Tim Hockin Mar 03 '19 at 02:58
0

Get the load-balancer's backend name with:

gcloud compute backend-services list

then

BACKEND=name-of-your-backend
gcloud compute backend-services update $BACKEND --timeout=600s

otherwise, in the console: Network services ⇒ Load balancing ⇒ Backends then you can click your HTTP backend(s) and edit the settings, including the timeout.

On a wider note, this may be one of serval hops between server and client, each of which might timeout. You're better off either living with the timeout (and making your long polls complete before the timeout), or drip feeding data down the line... for instance, you can preprend whitespace to json, so for instance, send a space character every 30 seconds until you have a proper response body. This will keep the load-balance from timing out.

spender
  • 117,338
  • 33
  • 229
  • 351
  • `gcloud compute backend-services list` returns 0 results. and this matches `gcloud compute project-info describe --project PROJECT_NAME` result which says usage: 0. However, I have k8s services up and running in the same project. I've created everything through the GKE console. Could that be the reason? – ilaif Mar 01 '19 at 14:19