GKE + WebSocket + NodePort 30s dropped connections

Question

I have a golang service that implements a WebSocket client using gorilla that is exposed to a Google Container Engine (GKE)/k8s cluster via a NodePort (30002 in this case).

I've got a manually created load balancer (i.e. NOT at k8s ingress/load balancer) with HTTP/HTTPS frontends (i.e. 80/443) that forward traffic to nodes in my GKE/k8s cluster on port 30002.

I can get my JavaScript WebSocket implementation in the browser (Chrome 58.0.3029.110 on OSX) to connect, upgrade and send / receive messages.

I log ping/pongs in the golang WebSocket client and all looks good until 30s in. 30s after connection my golang WebSocket client gets an EOF / close 1006 (abnormal closure) and my JavaScript code gets a close event. As far as I can tell, neither my Golang or JavaScript code is initiating the WebSocket closure.

I don't particularly care about session affinity in this case AFAIK, but I have tried both IP and cookie based affinity in the load balancer with long lived cookies.

Additionally, this exact same set of k8s deployment/pod/service specs and golang service code works great on my KOPS based k8s cluster on AWS through AWS' ELBs.

Any ideas where the 30s forced closures might be coming from? Could that be a k8s default cluster setting specific to GKE or something on the GCE load balancer?

Thanks for reading!

-- UPDATE --

There is a backend configuration timeout setting on the load balancer which is for "How long to wait for the backend service to respond before considering it a failed request".

The WebSocket is not unresponsive. It is sending ping/pong and other messages right up until getting killed which I can verify by console.log's in the browser and logs in the golang service.

That said, if I bump the load balancer backend timeout setting to 30000 seconds, things "work".

Doesn't feel like a real fix though because the load balancer will continue to feed actual unresponsive services traffic inappropriately, never mind if the WebSocket does become unresponsive.

I've isolated the high timeout setting to a specific backend setting using a path map, but hoping to come up with a real fix to the problem.

Could it be the load balancer that kills the connection after 30 seconds? — Oliver, May 18 '17 at 02:21
I don't know how I missed it, but the backend settings of the GCP load balancer has a timeout for "How long to wait for the backend service to respond before considering it a failed request" defaulting to 30s. The WebSocket is not unresponsive. It's sending/receiving ping/pongs and msgs right up until getting killed at 30s, which I see in browser and golang logs. If I set the load balancer timeout to 30000 seconds, things "work". Not a real fix IMO as other traffic through the load balancer will never become unresponsive; the load balancer will continue to send them traffic. — Jeremy Gordon, May 19 '17 at 05:12
We have the exact same issue and can confirm it is a problem with the loadbalancer and not Kubernetes/GKE. We have switched from TCP to HTTPS loadbalancer and have just noticed that socket connections were timing out at 30s. Tried changing the above mentioned setting but are still seeing connections dropped — jmartins, May 25 '17 at 10:11

score 3 · Accepted Answer · answered Jun 08 '17 at 07:07

3

I think this may be Working as Intended. Google just updated the documentation today (about an hour ago).

LB Proxy Support docs

Backend Service Components docs

Cheers,

Matt

answered Jun 08 '17 at 07:07

brugz

52
2

score -2 · Answer 2 · edited Dec 27 '17 at 22:19

-2

Check out the following example: https://github.com/kubernetes/ingress-gce/tree/master/examples/websocket

edited Dec 27 '17 at 22:19

Cyral

13,999
6
50
90

answered Jun 06 '17 at 05:36

ahmet alp balkan

42,679
38
138
214

Thanks for the link to the github discussion! As I mention in my post, I'm not using Kubernetes to load balance, so unless I'm missing something, I don't care about the affinity afforded by OnlyLocal but yes, as per my update, the timeout on the GCP load balancer was the key. That said, it's not a "real" solution because it means that it won't time out *actually" un-responsive connections. – Jeremy Gordon Jun 07 '17 at 06:21
Updated my response with an example to be more accurate. – ahmet alp balkan Jun 09 '17 at 20:27
@Chris I've updated the link (https://github.com/kubernetes/ingress-gce/tree/master/examples/websocket) – Cyral Dec 27 '17 at 22:19
1

Dead links again :'( – PoorBob Mar 06 '19 at 18:05
Example was removed but here is from a reference commit: https://github.com/kubernetes/ingress-gce/tree/b0603c69382a39d097063eab8e3d9e30ca1cdf7b/examples/websocket Also the original PR: https://github.com/kubernetes/ingress-nginx/pull/834/files – tudorprodan Jun 03 '19 at 09:15

GKE + WebSocket + NodePort 30s dropped connections

2 Answers2