3

We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know for sure the increment was called only-once? I want to understand what part I am missing.

Also, is CloudBigTableIO usable in Streaming mode or is it tied to Batch mode only? I guess we could use the BigTable HBase client directly in the pipeline but the connector seems to have nice properties like Connection-pooling which we would like to leverage and hence the question.

noobNeverything
  • 212
  • 4
  • 14

2 Answers2

3

The way that Dataflow (and other systems) offer the appearence of exactly-once execution in the presence of failures and retries is by requiring that side-effects (such as mutating BigTable) are idempotent. A "write" is idempotent because it is overwritten on retry. Inserts can be made idempotent by including a deterministic "insert ID" that deduplicates the insert.

For an increment, that is not the case. It is not supported because it would not be idempotent when retried, so it would not support exactly-once execution.

Ben Chambers
  • 6,070
  • 11
  • 16
  • Thanks for the response. I was reading https://cloud.google.com/dataflow/service/dataflow-service-desc#structuring-your-user-code and I just looked at the _exactly-once_ guarantee but dint realise it was tied with the _idempotency_ guarantee that is expected of the DoFn. So what Dataflow actually guarantees is _atleast-once_ and the _idempotency_ of the app itself helps make it _exactly-once_ - would be nice to make it more explicit in the documentation IMHO. – noobNeverything May 08 '17 at 22:42
  • Can you please explain how this _atleast-once_ semantic applies to [Stateful ParDo](https://beam.apache.org/blog/2017/02/13/stateful-processing.html). If a counter is maintained in the state of the `ParDo` and an element is retried in it, would it cause the counter to mutated twice for the same element (like any other side effect) or is the State mutation handled properly to be _exactly-once_? – noobNeverything May 09 '17 at 05:54
  • It is theoretically impossible to provide exactly once execution of side effects: if a worker dies while running your DoFn's code on an element, there's nothing Beam could do except run the code again. However, the Beam model semantics is exactly once in the sense that contents of all PCollections, values of metrics, state mutations etc happen as if the code ran exactly once, usually achieved via some transaction-like mechanisms in a runner. – jkff May 09 '17 at 06:45
  • Eg in case a worker dies while processing an element, and then the element is retried and processed successfully, the PCollection outputs and metric changes and state mutations etc from the successful processing are accounted, and results from the failed attempt are discarded, giving the impression that the element was processed exactly once. – jkff May 09 '17 at 06:47
  • I was slightly confused by mixing in side-effects and the actual pipeline data which I am clear about now. So any internal state is transactional but anything external has no such guarantees and all such external mutations (GCS/BigTable) need to be idempotent. But for example, if `CloudBigTableIO` does work in Streaming mode, why cant it still support an increment operation because they can still be performed transactionally and any failures could be reversed by performing its inverse operation? I understand it might be out of scope but trying to understand the feasibility of such an API. – noobNeverything May 09 '17 at 17:03
  • For a Stateful ParDo, Dataflow only commits the changes to the state when the successful processing of the element is committed, and these happen atomically. This ensures that either the increment and the element are both recorded, or neither are. This provides the effectively-once execution (eg., observably exactly-once) behavior. – Ben Chambers May 09 '17 at 17:35
  • The difference is that Dataflow manages state so when it has to abandon processing after a failure it can also abandon the state changes that have been made. Big Table is an external service, and there there is no generic way to both write to Big Table and commit the processing of elements within a single transaction. – Ben Chambers May 09 '17 at 17:52
1

CloudBigTableIO is usable in streaming mode. We had to implement a DoFn rather than a Sink in order to support that via the Dataflow SDK.

Solomon Duskis
  • 2,691
  • 16
  • 12