2

I have a Dataflow job that transforms data and writes out to BigQuery (batch job). Following the completion of the write operation I want to send a message to PubSub which will trigger further processing of the data in BigQuery. I have seen a few older questions/answers that hint at this being possible but only on streaming jobs:

I'm wondering if this is supported in any way for batch write jobs now? I cant use apache airflow to orchestrate all this unfortunately so sending a PubSub message seemed like the easiest way.

blablabla
  • 304
  • 1
  • 8
  • 18

1 Answers1

2

The conception of Beam implies the impossibility to do what you want. Indeed, you write a PCollection to BigQuery. By definition, a PCollection is a bounded or unbounded collection. How can you trigger something after a unbounded collection? When do you know that you have reach the end?

So, you have different way to achieve this. In your code, you can wait the pipeline completion and then publish a PubSub message.

Personally, I prefer to base this on the logs; When the the dataflow job is finish, I get the log of the end of job and I sink it into PubSub. That's decorrelated the pipeline code and the next step.

You can also have a look to Workflow. It's not really mature yet, but very promising for simple workflow like yours.

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • Thanks @guillaume for your reply. In this case the collection is bounded so there is a definitive endpoint for the write operation. In examples I saw like the link below it looked like it might be possible to send a pubsub notification on completion of the write operation but I think I read that is only supported for streaming inserts and I wondered if that had changed. https://stackoverflow.com/questions/51085326/how-to-notify-when-dataflow-job-is-complete/52598451#52598451 – blablabla Sep 11 '20 at 16:08
  • wrt waiting for pipeline completion and then publishing a PubSub message I didn't think that would be possible for a templated dataflow job. I thought the job would automatically finish as soon as the pipeline completes. Is that assumption incorrect? – blablabla Sep 11 '20 at 16:10
  • Correct, for a template, only the pipeline is captured, not what you write afterward. – guillaume blaquiere Sep 11 '20 at 18:49