gRPC - bidirectional stream goes to TRANSIENT_FAILURE if idle for too long

Question

I'm trying to get a fairly simple test scenario to work - I'd like to create a long-lived bidirectional streaming rpc that may sit idle for long periods of time (electron app with local server).

A Node gRPC client starts a C# gRPC server locally and initiates a bidirectional stream. The streaming service receives each message, waits 50 ms, and sends it back.

The Node client test code is set up to send 5 messages, wait 30 seconds, and then send 5 more messages. The first 5 messages successfully roundtrip. The second 5 messages eventually roundtrip, but not until 5 minutes later. The server side code is not hit during this time.

I'm sure I'm being a baboon here, but I don't understand why the connection seems to be dying so fast. I'm also not sure what options could help here, if any. It seems like keepalive is intended for tracking whether the TCP connection is still alive, but doesn't actually help keep it alive. idleTimeout doesn't seem relevant either, because we're going to TRANSIENT_FAILURE status according to the enum documentation here.

This discussion from 2016 is close to what I'm trying to do, but the solution was a RYO heartbeat. This grpc-dotnet issue seems to rely on a heartbeat-type solution specific to ASP.NET, which is not currently used.

gRPC server logs:

After the first 5 messages are sent:

transport 000001A7B5A63090 set connectivity_state=4

Start BDP ping err..."Endpoint read failed" (paraphrasing)

5 minutes later right before the second set of 5 messages comes through:

W:000001A7B5AC8A10 SERVER [ipv6:[::1]:57416] state IDLE -> WRITING [RETRY_SEND_PING]

Node library is @grpc/grpc-js

tl;dr How can I keep the connection healthy & working in the case of downtime?

What gRPC library are you using, exactly, in your Node client? Are the logs you quoted from the client or the server? What keepalive settings are you setting in the client? — murgatroid99, Mar 01 '21 at 21:45
@murgatroid99 node library: @grpc/grpc-js & server logs. thanks for responding. this is with no keepalive settings — Doug Denniston, Mar 02 '21 at 12:42
You said "It seems like keepalive is intended for tracking whether the TCP connection is still alive, but doesn't actually help keep it alive." and the link there is about setting up keepalive on the client. I assumed you meant that you tried setting up keepalive on the client and it didn't work for you. — murgatroid99, Mar 02 '21 at 17:33
@murgatroid99 ah ok. Sorry, what I was trying to express is that I tried with and without keepalives on the client and it didn't seem to make a difference, and the logs above are without keepalive. What I tried on the client: GRPC_ARG_KEEPALIVE_TIME_MS, GRPC_ARG_KEEPALIVE_TIMEOUT_MS, GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS. Am I mistaken about the purpose of keepalives? Does it actually do something to keep the connection healthy or is it just a health check? — Doug Denniston, Mar 03 '21 at 13:29
Yes, the purpose of the keepalive functionality is to keep the connection alive. I am trying to determine whether the problem here is that keepalives were not correctly configured, or that there is a bug in the keepalive code, and I think your last comment answers that. — murgatroid99, Mar 03 '21 at 17:33
OK. I'll try to find the time to set up a minimal repro if that would be helpful — Doug Denniston, Mar 03 '21 at 17:52
I think it might be helpful. Reading the question again, I noticed that you said you only stop sending for 30 seconds. That's a weirdly short time for an idle connection to be killed, so I'm not sure that keepalive is actually the culprit here. — murgatroid99, Mar 03 '21 at 18:08

gRPC - bidirectional stream goes to TRANSIENT_FAILURE if idle for too long

gRPC server logs:

0 Answers0