I've managed to use Google Cloud Scheduler to schedule a dataflow pipeline running, but I also want the pipeline to run for max an hour. Is it possible to schedule an end time for dataflow?
edit: I've created a pipeline that would wait a certain amount of time then cancel, but I'm getting the error on the cancel() line IOError: Failed to get the Dataflow job id.
Here's the pipeline code:
p = beam.Pipeline(options=PipelineOptions(region='us-central1'))
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x:x.decode('utf-8'))
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('{0}:MarkTest.scraped'.format(PROJECT), schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish(duration=3000)
result.cancel() # If the pipeline has not finished, you can cancel it