We have a java based Data flow pipeline which reads from Bigtable, after some processing write data back to Bigtable. We use CloudBigtableIO for these purposes.
I am trying wrap my head around failure handling in CloudBigtableIO. I haven;t found any references/documentation on how the errors are handled inside and outside the CloudBigtableIO.
CloudBigtableIO has bunch of Options in BigtableOptionsFactory which specify timeouts, grpc codes to retry on, retry limits.
google.bigtable.grpc.retry.max.scan.timeout.retries - is this the retry limit for scan operations or does it include Mutation operations as well? if this is just for scan, how many retries are done for Mutation operations? is it configurable? google.bigtable.grpc.retry.codes - Do these codes enable retries for both scan, Mutate operations?
Customizing options would only enable retries, would there be cases where CloudBigtableIO reads partial data than what is requested but not fails the pipeline?
When mutating few millions of records, I think it is possible we get errors beyond the retry limits, what happens to such mutations? do they fail simply? how do we handle them in pipeline? BigQueryIO has function that collects failures and provides a way to retrieve them through side output, why do CloudBigtableIO doesn't have one such functions?
We occasionally get DEADLINE_EXCEEDED errors while writing mutations but the logs are not clear whether the mutations were retried and successful or Retries were exhausted, I do see RetriesExhaustedWithDetailsException but that is of no use, if we are not able to handle failures
Are these failures thrown back to the preceding step in data flow pipeline if preceding step and CloudBigtableIO write are fused? with bulk mutations enabled it is not really clear on how the failures are thrown back to preceding steps.