3

In Google cloud platform - Dataflow, when streaming unbounded PCollection (say from PubSub topic using PubSubIO), is there an efficient way to start and stop the beam pipeline in Dataflow? (Example running at start of day and ending at end of day) Is the only way to have a scheduler to have a Cron App engine service and to start the above pipeline job and then stop the job? Just looking at if there are any other options out there.

Also, in case if I choose windowing for the unbounded PCollection(say from PubSub), is there a way to have the files written in a configurable directory say. hourly directory for every window? I see it creates one file for every window.

Roshan Fernando
  • 493
  • 11
  • 31

2 Answers2

3

I agree with Pablo that Airflow (and Cloud Composer from the GCP side) is a good choice for the first part of your question.

Regarding the second part of your question, you can see the Google-Provided Dataflow Template for streaming pipeline from Cloud Pub/Sub to Google Cloud Storage files, you can easily create hourly directories by setting the outputDirectory to gs:///YYYY/MM/DD/HH/ and it will automatically replace YYYY, MM, DD and HH for the values of the interval window.

If you need to adapt this template to your specific needs, you can check the source code of the template.

Héctor Neri
  • 1,384
  • 9
  • 13
  • Thanks, Cloud composer looked a good fit, but due to the network policy in my organization, it doesn't allow creation of the Cloud composer environment using Public ip. And currently, there is no option for Cloud composer to restrict public ips. – Roshan Fernando Nov 13 '18 at 02:49
  • 1
    For this you can use a proxy service and a GCE VM with a static IP as commented in [this answer](https://stackoverflow.com/a/50793805/9910124) and a firewall rule for the network of this instance to restrict IPs. Alternatively, since Apache Airflow is open source, you can also use it by its own. If you're interested, you may find further help for IT systems (instead of coding, which is the scope of Stack Overflow) in [Server Fault](https://serverfault.com/). – Héctor Neri Nov 16 '18 at 21:57
1

You should check out Apache Airflow (incubating), which is a new project donated by AirBnB, and which allows to schedule workflows, among of which Apache Beam is supported as well.

Pablo
  • 10,425
  • 1
  • 44
  • 67