8

I have two microservices that communicate each other thru gRPC, A is the RPC client and B is the RPC server, both written in NodeJS using grpc NPM module.

Everything is working fine until, at some point in time, unexpectedly A stop being able to send requests to B, it fails because of a timeout (5s) and throw this error:

Error: Deadline Exceeded

Both microservices are Docker containers, run on AWS ECS and communicate thru AWS ELB (not ALB because it does not support HTTP2 and some other problems).

I tried to run telnet from A to the ELB of B, both from the EC2 instance and from the running ECS task (the Docker container itself) itself and it connected fine, but still, the NodeJS application in A cannot reach the NodeJS application in B using the gRPC connection.

The only way to solve it is to stop and start the ECS tasks and then A succeed to connect to B again (until the next unexpected time the same scenario is reproduced), but it's not a solution of course.

Do anyone faced with that kind of issue?

Shlomi
  • 3,622
  • 5
  • 23
  • 34
  • 1
    What was the conclusion for this issue? Did you (and how) resolve it? I am currently experiencing same issue. Deadling Exceeded for one unary grpc call, and then started to failing for every single after that one? – cool Jul 13 '18 at 10:46
  • Facing the same issue, any solutions? – Drake .C Nov 14 '20 at 00:08

2 Answers2

0

Do you use unary or streaming API? Do you set any deadline? gRPC deadline is per-stream, so in case of streaming when you set X milliseconds deadline, you'll get DEADLINE_EXCEEDED X milliseconds after you opened a stream (not send or receive any messages!). And you'll keep getting it forever for this stream, the only way to get rid of it is reopening a stream.

Alex Borysov
  • 281
  • 1
  • 4
  • We're using both unary (for single models) and streaming (for collections) but the error i mentioned is received for the unary. – Shlomi Jan 03 '18 at 06:07
0

I have found that I need to create both a new stub, but also re-create the connection after some errors in order to get it to reconnect. (Also running in ECS)

jdwyah
  • 1,253
  • 1
  • 11
  • 22