gRPC keepalive not working on alibaba ACK

Question

I've built some microservices using kubernetes and the pods communicate using gRPC. I've set keepalive settings on both client and server (copy and pasted the settings from here: https://cs.mcgill.ca/~mxia3/2019/02/23/Using-gRPC-in-Production/ ), however after a period of inactivity the first gRPC request will still fail, returning an UNAVAILABLE error.

I'm using Alibaba's ACK to host my app, and built my app in python.

Anyone have any ideas on how to troubleshoot? I'm at a loss.

See: [UNAVAILABLE](https://github.com/grpc/grpc-go#the-rpc-failed-with-error-code--unavailable-desc--transport-is-closing) for the Go SDK but generally applicable (it's challenging) and consider reducing the value of `grpc.keepalive_time_ms` to further increase the frequency. It's possible that Alibaba is too aggressively harvesting the underlying TCP connections and borking the overlying HTTP/2. You may want to increase logging to see whether you can capture TCP connections being closed. Because everything's managed, you may be limited in what you can change. — DazWilkin, Nov 07 '21 at 16:14
Thanks for the response, I'm still a little confused though - what does 'harvesting' mean in this case? — William Gazeley, Nov 08 '21 at 00:11
Apologies. HTTP/2 uses TCP and is connection oriented. The keep-alive is a way to check whether the gRPC peer is alive and, by keeping traffic on the TCP connection(s) has a possible side-effect of keeping the TCP connection open. But the TCP layer is unaware of HTTP/2 and gRPC and the infrastructure managing TCP connections uses its own logic to determine when a TCP connection is no longer needed and could be terminating these too aggressively. Additionally, you will have different connections between clients, TCP proxies and servers and only one TCP connection need fail — DazWilkin, Nov 08 '21 at 01:09
I see.. it's a little hackey, but would adding retries be an acceptable solution do you think? Service will always be slow for the first user in a while though. I'll try logging to catch tcp close, but I still wouldn't know how to prevent it — William Gazeley, Nov 08 '21 at 05:34
Are the pods OK after a period of inactivity? Can the pods ping each other? — Mikolaj S., Nov 08 '21 at 15:21
Yes, but the first request returns an Unavailable error. The second request will then work as expected. I'm not sure how to ping the pods — William Gazeley, Nov 08 '21 at 18:10
I mean to get IP addresses of the pods (`kubectl get pods -o wide` command), then exec into each pod using [`kuebctl exec` command](https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running-container/) and try to ping the another pod. Please let know the results. — Mikolaj S., Nov 09 '21 at 18:04
Ping works, but gRPC for the first request still fails after a few hours of inactivity: --- 10.95.0.25 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3037ms rtt min/avg/max/mdev = 0.070/0.100/0.168/0.039 ms root@api-6c9fcb7c89-zvsmp:/api# — William Gazeley, Nov 10 '21 at 03:53

gRPC keepalive not working on alibaba ACK

0 Answers0