Execute a process exactly after BigQueryIO.write() operation

Question

I have a pipeline with a BigQuery table as sink. I need to perform some steps exactly after data has been written to BigQuery. Those steps include performing queries on that table, read data from it and write to a different table.

How to achieve the above? Should I create a different pipeline for the latter but then calling it after the 1st pipeline will be another problem I assume.

If none of the above work, is it possible to call another dataflow job(template) from a running pipeline.

Really need some help with this.

Thanks.

score 1 · Answer 1 · answered Oct 03 '17 at 22:46

1

This is currently not explicitly supported by BigQueryIO. The only workaround is to use separate pipelines: start the first pipeline, wait for it to finish (eg. using pipeline.run().waitUntilFinish()), start the second pipeline (make sure to use a separate Pipeline object for it - reusing the same object multiple times is not supported).

answered Oct 03 '17 at 22:46

jkff

17,623
5
53
85

Just to add - you wouldn't necessarily have to use another pipeline to achieve this. After the first pipeline finishes (`pipeline.run().waitUntilFinish()`), then you could just drop back into using the BigQuery SDK. We do this a lot in our pipelines and the pattern works well. https://stackoverflow.com/questions/44315157/perform-action-after-dataflow-pipeline-has-processed-all-data/44328850#44328850 – Graham Polley Oct 04 '17 at 02:53
@jkff How to make it work in case I'm creating templates? So will I have separate templates for the two pipelines? What if I wanted to create a single template that will run both pipelines? – rish0097 Oct 04 '17 at 09:13
This is unfortunately not possible with templates. – jkff Oct 04 '17 at 16:15

saccodd · Answer 2 · 2019-06-06T20:23:02.113

A workaround I have been using with templates is writing the result of IO operations to a metadata file into a specific bucket, a cloud function (that is my orchestrator) gets triggered, and that, in turn, triggers the following pipeline. However, I tested it only with TextIO operations. So, in your case:

Perform BigQueryIO.write() operation
Write its result to a file (xxx-meta-file) into a Cloud Storage bucket (xxx-meta-bucket) where you keep only Dataflow results - this is the last step of your pipeline
Write an orchestrator Cloud Function that listens to created/modified objects in xxx-meta-bucket (see here)
In the orchestrator, you will likely need some condition to check what file was actually created/modified
Trigger the next pipeline accordingly (directly in the orchestrator or decouple it by triggering another cloud function responsible for starting that specific pipeline)

Pretty sure a similar approach can be easily replicated using PubSub instead of writing to buckets (e.g. see here for the second step in my list)

Execute a process exactly after BigQueryIO.write() operation

2 Answers2

Linked