Running a streaming beam pipeline where i stream files/records from gcs using avroIO and then create minutely/hourly buckets to aggregate events and add it to BQ. In case the pipeline fails how can i recover correctly and process the unprocessed events only ? I do not want to double count events . One approach i was thinking was writing to spanner or bigtable but it may be the case the write to BQ succeeds but the DB fails and vice versa ? How can i maintain a state in reliable consistent way in streaming pipeline to process only unprocessed events ? I want to make sure the final aggregated data in BQ is the exact count for different events and not under or over counting ? How does spark streaming pipeline solve this (I know they have some checkpointing directory for managing state of query and dataframes ) ? Are there any recommended techniques to solve accurately these kind of problem in streaming pipelines ?

- 841
- 9
- 31
-
This is a challenging problem to solve with BigQuery - idempotent streaming operations. Unfortunately, the only full proof way I've achieved this is by resorting to batch processing. Batch processing allows for completely overwriting tables in BigQuery with some bounded data set. – Andrew Nguonly Jan 19 '18 at 17:00
-
Not just bigquery , imagine if you had to write those aggregated window event counts to bigtable or spanner or gcs ? Is that any easier ? The main point is how to determine reliably what events have been processed, how to maintain this state . Batch processing is not real time enough for monitoring or analytics – user179156 Jan 19 '18 at 18:24
-
Have you considered using pubsub? Also, if you write to bigtable or spanner counts keyed by event time buckets, reprocessing processed events will overwrite last processed results, which should be fine (?) for your use case. [This](https://cloud.google.com/blog/big-data/2017/07/after-lambda-exactly-once-processing-in-cloud-dataflow-part-3-sources-and-sinks) blog post has some useful info about exactly-once processing. – Jiayuan Ma Jan 21 '18 at 02:02
-
I have read the post above , but it talks only about processing exactly once within a single job and state management within single job. My case is more towards failure of streaming pipeline , and how to maintain state across multiple streaming pipeline job . The blog mostly talk about within the scope of single job . I do think some techniques may be applicable but nowhere they talk about any checkpointing mechanism that can help me restart a job while maintaining state from previous job. – user179156 Jan 21 '18 at 03:11
-
YEs writing to bt or spanner by timebuckets is an option , but ultimately my goal is to have some engine that can help run sql queries for analytics purpose while consuming real time data (ideally i would want to dump to BQ) . So even if i write to BT , i need some way for extracting the recent data or providing layer of sql on top of BT schema. – user179156 Jan 21 '18 at 03:14
-
pubsub is not so reliable , we have run into different issues . client not acking , too expensive for data volume, and still does not exactly provide a solution , since pubsub may deliver only messages that were not acked , but i want something more granular that is messages not processed. So for example if streaming pipeline consume 1k message , ack them , and then fail when processing say 100 message , having a pubsub doesn't entirely solve the problem. – user179156 Jan 21 '18 at 03:17
-
Maybe you can ack the message after processing them successfully? – Jiayuan Ma Jan 21 '18 at 21:15
-
ack is handled by pubsubIO itself . Also again this doesn't solve the problem , what if you process and fail to ack ? – user179156 Jan 22 '18 at 02:17
-
Short answer is that there is no reliable way to ensure exactly-once processing between two completely disconnected streaming runs. Streaming engines like Dataflow, Flink store required state internally. With Flink you could restart from latest checkpoint, and with Dataflow you can 'Update' a running pipeline (note that Dataflow does not actually kill your job even when there errors, you need to cancel a job explicitly). Dataflow does provide exactly-once processing guarantee with update. – Raghu Angadi Jan 23 '18 at 01:11
1 Answers
Based on clarification from the comments, this question boils down to 'can we achieve exactly-once semantics across two successive runs of a streaming job, assuming both runs are start from scratch?'. Short answer is no. Even if the user is willing store some state in external storage, it needs to be committed atomically/consistently with streaming engine internal state. Streaming engines like Dataflow, Flink store required state internally, which is needed for to 'resume' a job. With Flink you could resume from latest savepoint, and with Dataflow you can 'update' a running pipeline (note that Dataflow does not actually kill your job even when there are errors, you need to cancel a job explicitly). Dataflow does provide exactly-once processing guarantee with update.
Some what relaxed guarantees would be feasible with careful use of external storage. The details really depend on specific goals (often it is is no worth the extra complexity).

- 814
- 4
- 5