How to launch a Dataflow job with Apache Airflow and not block other tasks?

Question

Problem

Airflow tasks of the type DataflowTemplateOperator take a long time to complete. This means other tasks can be blocked by it (correct?).

When we run more of these tasks, that means we would need a bigger Cloud Composer cluster (in our case) to execute tasks that are essentially blocking while they shouldn't be (they should be async operations).

Options

Option 1: just launch the job and airflow job is successful
Option 2: write a wrapper as explained here and use a reschedule mode as explained here

Option 1 does not seem feasible as the DataflowTemplateOperator only has an option to specify the wait time between completion checks called poll_sleep (source).

For the DataflowCreateJavaJobOperator there is an option check_if_running to wait for completion of a previous job with the same name (see this code)

It seems that after launching a job, the wait_for_finish is executed (see this line), which boils down to an "incomplete" job (see this line).

For Option 2, I need Option 1.

Questions

Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.

score 1 · Answer 1 · answered Jan 27 '20 at 14:19

Answers

Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
A: Partly yes. Airflow has parallelism option in the configuration which define the number of tasks that should execute at a time across the system. Having a task block this slot might slow down the execution in the system but this issue is bound to happen as you increase the number of tasks and DAGs. You can increase this in the configuration depending on your needs

Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
A: Yes. You can use PythonOperator and in the python_callable you can use the dataflow hook to launch the job in async mode (launch and don't wait).

Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode. A: When you say reschedule, I'm assuming that you are going to retry the task that looks for job that checks if the job finished correctly. If I'm right, you can set the task on retry mode and the delay at which you want the retry to happen.

Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
A: I think I answered this in the second question.

Sync or Async totally depends on your dependency flow... When you add a dependency over dataflow template operator, the downstream tasks will always be dependent on dataflow operator... If you do not want this, you have to design a dependency that will ensure that jobs are launching in async mode and Airflow tasks are not blocked... — Tameem, Feb 02 '20 at 04:17
If waiting too long from dataflow template operator is causing a problem to you, you can create your own dataflow template operator that will fail the Dataflow task when the dataflow job is not in it's finalized States (JOB_STATE_DONE or JOB_STATE_FAILED) and mark to retry after a certain delay... This can also ensure that the tasks are not blocked and you can have tasks in sync mode... — Tameem, Feb 02 '20 at 04:21

score 0 · Answer 2 · answered Jul 28 '23 at 14:45

The question and the answer are a little bit old but they still pop up in the search results. To complete them, it's possible not to block with the native operators and sensors. According to the Apache Airflow GCP Dataflow providers documentation:

Dataflow has multiple options of executing pipelines. It can be done in the following modes: batch asynchronously (fire and forget), batch blocking (wait until completion), or streaming (run indefinitely). In Airflow it is best practice to use asynchronous batch pipelines or streams and use sensors to listen for expected job state.

You should be able to use the native operators and sensors for the highlighted (async) execution, as shown in this example:

 # [START howto_operator_start_python_job_async]
start_python_job_async = BeamRunPythonPipelineOperator(
        task_id="start_python_job_async",
        runner=BeamRunnerType.DataflowRunner,
        py_file=GCS_PYTHON_SCRIPT,
        py_options=[],
        pipeline_options={
            "output": GCS_OUTPUT,
        },
        py_requirements=["apache-beam[gcp]==2.46.0"],
        py_interpreter="python3",
        py_system_site_packages=False,
        dataflow_config={
            "job_name": "start_python_job_async",
            "location": LOCATION,
            "wait_until_finished": False,
        },
)
# [END howto_operator_start_python_job_async]

# [START howto_sensor_wait_for_job_status]
wait_for_python_job_async_done = DataflowJobStatusSensor(
        task_id="wait_for_python_job_async_done",
        job_id="{{task_instance.xcom_pull('start_python_job_async')['dataflow_job_id']}}",
        expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
        location=LOCATION,
)
# [END howto_sensor_wait_for_job_status]

How to launch a Dataflow job with Apache Airflow and not block other tasks?

Problem

Options

Questions

2 Answers2

Answers