0

I've managed to use Google Cloud Scheduler to schedule a dataflow pipeline running, but I also want the pipeline to run for max an hour. Is it possible to schedule an end time for dataflow?

edit: I've created a pipeline that would wait a certain amount of time then cancel, but I'm getting the error on the cancel() line IOError: Failed to get the Dataflow job id.

Here's the pipeline code:

p = beam.Pipeline(options=PipelineOptions(region='us-central1'))

(p
    | 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
    | 'Decode' >> beam.Map(lambda x:x.decode('utf-8'))
    | 'WriteToBigQuery' >> beam.io.WriteToBigQuery('{0}:MarkTest.scraped'.format(PROJECT), schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish(duration=3000)

result.cancel()   # If the pipeline has not finished, you can cancel it

0 Answers0