I am new to apache beam and exploring python version of apache beam dataflow. I want to execute my dataflow tasks in certain order but it executes all tasks in parallel mode. How to create task dependency in apache beam python?
Sample Code: (in this below code sample.json file contains 5 rows)
import apache_beam as beam
import logging
from apache_beam.options.pipeline_options import PipelineOptions
class Sample(beam.PTransform):
def __init__(self, index):
self.index = index
def expand(self, pcoll):
logging.info(self.index)
return pcoll
class LoadData(beam.DoFn):
def process(self, context):
logging.info("***")
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
pipeline = beam.Pipeline(options=PipelineOptions())
(pipeline
| "one" >> Sample(1)
| "two: Read" >> beam.io.ReadFromText('sample.json')
| "three: show" >> beam.ParDo(LoadData())
| "four: sample2" >> Sample(2)
)
pipeline.run().wait_until_finish()
I expected it will execute in order one, two, three, four. But it is running in parallel mode.
output of above code:
INFO:root:Missing pipeline option (runner). Executing pipeline using the
default runner: DirectRunner.
INFO:root:1
INFO:root:2
INFO:root:Running pipeline with DirectRunner.
INFO:root:***
INFO:root:***
INFO:root:***
INFO:root:***
INFO:root:***