2

We have a java based Data flow pipeline which reads from Bigtable, after some processing write data back to Bigtable. We use CloudBigtableIO for these purposes.

I am trying wrap my head around failure handling in CloudBigtableIO. I haven;t found any references/documentation on how the errors are handled inside and outside the CloudBigtableIO.

  1. CloudBigtableIO has bunch of Options in BigtableOptionsFactory which specify timeouts, grpc codes to retry on, retry limits.

    google.bigtable.grpc.retry.max.scan.timeout.retries - is this the retry limit for scan operations or does it include Mutation operations as well? if this is just for scan, how many retries are done for Mutation operations? is it configurable? google.bigtable.grpc.retry.codes - Do these codes enable retries for both scan, Mutate operations?

  2. Customizing options would only enable retries, would there be cases where CloudBigtableIO reads partial data than what is requested but not fails the pipeline?

  3. When mutating few millions of records, I think it is possible we get errors beyond the retry limits, what happens to such mutations? do they fail simply? how do we handle them in pipeline? BigQueryIO has function that collects failures and provides a way to retrieve them through side output, why do CloudBigtableIO doesn't have one such functions?

    We occasionally get DEADLINE_EXCEEDED errors while writing mutations but the logs are not clear whether the mutations were retried and successful or Retries were exhausted, I do see RetriesExhaustedWithDetailsException but that is of no use, if we are not able to handle failures

  4. Are these failures thrown back to the preceding step in data flow pipeline if preceding step and CloudBigtableIO write are fused? with bulk mutations enabled it is not really clear on how the failures are thrown back to preceding steps.

2 Answers2

3

For question 1, I believe google.bigtable.mutate.rpc.timeout.ms would correspond to mutation operations, though it is noted in the Javadoc that the feature is experimental. google.bigtable.grpc.retry.codes allows you to add additional codes to retry on that are not set by default (defaults include DEADLINE_EXCEEDED, UNAVAILABLE, ABORTED, and UNAUTHENTICATED)

You can see an example of the configuration getting set for mutation timeouts here: https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-client-core-parent/bigtable-hbase/src/test/java/com/google/cloud/bigtable/hbase/TestBigtableOptionsFactory.java#L169

kol
  • 31
  • 1
0
  • google.bigtable.grpc.retry.max.scan.timeout.retries:

    It is only for setting the number of time to retry after a SCAN timeout.

  • Regarding retries on mutation operations

    This is how Bigtable handles operations failures.

  • Regarding your question about handling errors in the pipeline

    I see that you already are aware for the "RetriesExhaustedWithDetailsException". Please keep in mind that in order to retrieve the detailed exceptions for each failed request you have to call the "RetriesExhaustedWithDetailsException#getCauses()"

  • As for the failures, Google documentation states:

    " Append and Increment operations are not suitable for retriable batch programming models, including Hadoop and Cloud Dataflow, and are therefore not supported inputs to CloudBigtableIO.writeToTable. Dataflow bundles, or a group of inputs, can fail even though some of the inputs have been processed. In those cases, the entire bundle will be retried, and previously completed Append and Increment operations would be performed a second time, resulting in incorrect data."

Some documentation that you may consider helpful:

Hope you find the above helpful.


tzovourn
  • 1,293
  • 8
  • 18