0

I am having a problem with gRPC C++ client making calls against google cloud Bigtable. These calls occasionally hang and it is only if the call deadline is set the call returns. There is an issue filed with gRPC team: https://github.com/grpc/grpc/issues/6278 with stack trace and a piece of gRPC tracing log provided.

The call that hangs most often is ReadRows stream read call. I have seen MutateRow call hanging a few times as well but that is rather rare.

gRPC tracing shows that there is some response coming back from the server, however that response seems to be insufficient for gRPC client to go on.

I did spend a fair amount of time debugging the code, no obvious problems found so far on the client side, no memory corruptions seen. This is a single-threaded application, making one call at a time, client side concurrency is not a suspect. Client runs on google compute engine box, so the network likely is not an issue as well. gRPC client is kept up to date with the github repository main line.

Any suggestions would be appreciated. If anyone have debugging ideas that would be great as well. Using valgrind, gdb, reducing the application to a subset with reproducible results did not help so far, I have not been able to find out what the problem is. The problem is random and shows up occasionally.

Additional note on May 17, 2016 There was a suggestion that re-tries may help to deal with the issue. Unfortunately re-tries do not work very well for us because we would have to carry that over into the application logic. We can easily re-try updates, which is MutateRow calls, and we do that, these are not streaming calls and easy to re-try. However once the iteration of the DB query results has begun by the application, if it fails, the re-trying means that the application needs to re-issue the query and start iteration of the results again. Which is problematic. It is always possible to consider a change that would make our applications to read the whole result set at once and then at the application level iterations can be done in memory. Then re-tries can be handled. But that is problematic for all kinds of reasons, like memory footprint and application latencies. We want to process DB query results as soon as they arrive, not when all of them are in memory. There is also timeout added to the call latency when the call hangs. So, re-tries of the query result iterations are really costly to such a degree that they are not practical.

ay60
  • 41
  • 4
  • Long running reads are definitely problematic. The Java Bigtable client keeps track of what it saw, and then creates a new request starting after the last seen rows. This is a nuanced problem. Feel free to reach out to me at sduskis at goog le dot com. – Solomon Duskis May 26 '16 at 19:53
  • These are not long running queries. The failures that I see are at usually the very first read calls of the stream. And these are not long running sessions. I see these failures within a minute or two of a test app running. – ay60 Jun 08 '16 at 19:57

1 Answers1

1

We've experienced hanging issues with gRPC in various languages. The gRPC team is investigating.

Solomon Duskis
  • 2,691
  • 16
  • 12
  • Any progress with the investigation? Any suggestions to what we can do to mitigate the issue? Retries were suggested but retries do not work very well for us with the streaming calls. – ay60 May 24 '16 at 16:45
  • We had some internal discussions. Is the hanging related to connections closing and the dreaded max_age issue? If so, the guidance is to create a retry framework that retries the puts/gets. Here's one of the java classes we use as part of our retries: https://github.com/GoogleCloudPlatform/cloud-bigtable-client/blob/master/bigtable-client-core/src/main/java/com/google/cloud/bigtable/grpc/async/AbstractRetryingRpcListener.java – Solomon Duskis May 25 '16 at 22:37
  • No, this is not the max_age issue. I see hanging happening after a few minutes of operation, and this is random. – ay60 May 26 '16 at 20:17
  • I'm not sure if it helps, but here's how we do retries in Java for the streaming read problem: https://github.com/GoogleCloudPlatform/cloud-bigtable-client/blob/master/bigtable-client-core/src/main/java/com/google/cloud/bigtable/grpc/scanner/ResumingStreamingResultScanner.java – Solomon Duskis May 27 '16 at 22:35
  • Can you please reach out to me privately on sduskis at goog le dot com so that I can give you more information? – Solomon Duskis May 27 '16 at 22:36
  • See my notes from May 17 above, retries are not a viable solution for us, not on the streaming calls. – ay60 Jun 02 '16 at 16:24