0

I am new to apache beam and exploring python version of apache beam dataflow. I want to execute my dataflow tasks in certain order but it executes all tasks in parallel mode. How to create task dependency in apache beam python?

Sample Code: (in this below code sample.json file contains 5 rows)

import apache_beam as beam
import logging
from apache_beam.options.pipeline_options import PipelineOptions

class Sample(beam.PTransform):
    def __init__(self, index):
        self.index = index

    def expand(self, pcoll):
        logging.info(self.index)
        return pcoll

class LoadData(beam.DoFn):
    def process(self, context):
        logging.info("***")

if __name__ == '__main__':

    logging.getLogger().setLevel(logging.INFO)
    pipeline = beam.Pipeline(options=PipelineOptions())

    (pipeline
        | "one" >> Sample(1)
        | "two: Read" >> beam.io.ReadFromText('sample.json')
        | "three: show" >> beam.ParDo(LoadData())
        | "four: sample2" >> Sample(2)
    )
    pipeline.run().wait_until_finish()

I expected it will execute in order one, two, three, four. But it is running in parallel mode.

output of above code:

INFO:root:Missing pipeline option (runner). Executing pipeline using the 
default runner: DirectRunner.
INFO:root:1
INFO:root:2
INFO:root:Running pipeline with DirectRunner.
INFO:root:***
INFO:root:***
INFO:root:***
INFO:root:***
INFO:root:***
MJK
  • 1,381
  • 3
  • 15
  • 22
  • What are you trying to accomplish by executing this in sequence? Also, I'm not sure what your "Sample" transform is supposed to do: as implemented, it does nothing. Also keep in mind that, much like a database query plan, a pipeline is first constructed (that's when you're seeing the logging from expand()), and then optimized by the runner and executed (that's when you're seeing "***"). – jkff Mar 17 '18 at 18:23
  • @jkff I want to load data from biquery to elasticsearch. In my sample transform, I am doing operations like creating, reindexing, deleting elasticsearch indexes. So first i need to create an temp index, second loading data and to ES temp index, third reindex it, fourth delete my temp index. So I want to execute all these task in ordered manner. but here creating, reindexing and deleting tasks are executed first and finally load data has executed. (you can see logs "*****" showed at last) – MJK Mar 18 '18 at 06:08

1 Answers1

1

As per Dataflow's documentation:

When the pipeline runner builds your actual pipeline for distributed execution, the pipeline may be optimized. For example, it may be more computationally efficient to run certain transforms together, or in a different order. The Dataflow service fully manages this aspect of your pipeline's execution.

Also as per Apache Beam's documentation:

The APIs emphasize processing elements in parallel, which makes it difficult to express actions like “assign a sequence number to each element in a PCollection”. This is intentional as such algorithms are much more likely to suffer from scalability problems. Processing all elements in parallel also has some drawbacks. Specifically, it makes it impossible to batch any operations, such as writing elements to a sink or checkpointing progress during processing.

So the thing is that Dataflow and Apache Beam are parallel by nature; they were designed to deal with embarrassingly parallel use cases and are maybe not the best tool if you require that operations are executed in an specific order. As @jkff pointed out, Dataflow will optimize the Pipeline in such a way that it parallelizes the operations in the best possible way.

If you really need to execute each of the steps in a consecutive order, the workaround is to use a blocking execution instead, using the waitUntilFinish() method as explained in this other Stack Overflow answer. However, my understanding is that such an implementation would only work in a Batch Pipeline, as Streaming Pipelines would be consuming data continuously and therefore you cannot block the execution to work on consecutive steps.

dsesto
  • 7,864
  • 2
  • 33
  • 50