0

I have Airflow DAG written as below:

with DAG(
    dag_id='dag',
    default_args={
        'owner': 'airflow',
        'depends_on_past': False,
        'email': ['airflow@example.com'],
        'email_on_failure': False,
        'email_on_retry': False,
    },
    dagrun_timeout=timedelta(hours=2),
    start_date=datetime(2021, 9, 28, 11),
    schedule_interval='10 * * * *'
) as dag:

    create_job_flow = EmrCreateJobFlowOperator(
        task_id='create_job_flow',
        job_flow_overrides=JOB_FLOW_OVERRIDES,
        aws_conn_id='aws_default',
        emr_conn_id='emr_default',
    )

    job_step = EmrAddStepsOperator(
        task_id='job_step',
        job_flow_id=create_job_flow.output,
        aws_conn_id='aws_default',
        steps=JOB_SETP,
    )

    job_step_sensor = EmrStepSensor(
        task_id='job_step_sensor',
        job_flow_id=create_job_flow.output,
        step_id="{{ task_instance.xcom_pull(task_ids='job_step', key='return_value')[0] }}",
        aws_conn_id='aws_default',
    )

    read_file = PythonOperator(
        task_id="read_file",
        python_callable=get_date_information
    )

    alter_partitions = PythonOperator(
        task_id="alter_partitions",
        python_callable=update_partitions
    )

    remove_cluster = EmrTerminateJobFlowOperator(
        task_id='remove_cluster',
        job_flow_id=create_job_flow.output,
        aws_conn_id='aws_default',
    )


    create_job_flow.set_downstream(job_step)
    job_step.set_downstream(job_step_sensor)
    job_step_sensor.set_downstream(read_file)
    read_file.set_downstream(alter_partitions)
    alter_partitions.set_downstream(remove_cluster)

So this is basically creating an EMR cluster, starting a Step in it and sensing that step. Then execute some Python functions and finally terminate the cluster. The view of the DAG in Airflow UI is as below:

enter image description here

Here, create_job_flow is also pointing to remove_cluster (maybe because the job_flow_id has a reference to create_job_flow) whereas I only set the downstream of alter_partitions to remove_cluster. Would this happen that before reaching job_step, it would remove the cluster as in that case the cluster will already be deleted before executing the split_job_step and hence that is the problem. Is there any way to remove the link between create_job_flow and remove_cluster? Or would it wait to finish alter_partitions and would then execute remove_cluster?

seou1
  • 446
  • 1
  • 5
  • 21

1 Answers1

1

The "remove_cluster" task will wait until the "alter_partitions" task is completed. The extra edge between "create_job_flow" and "remove_cluster" (as well as between "create_job_flow" and "job_step_sensor") is a feature of the TaskFlow API and the XComArg concept, namely the use of an operator's .output property. (Check out this documentation for another example.)

In both the "remove_cluster" and "job_step_sensor" tasks, job_flow_id=create_job_flow.output is an input arg. Behind the scenes, when an operator's .output is used in a templated field as an input of another task, a dependency is automatically created. This feature ensures what were previously implicit task dependencies between tasks using other tasks' XComs are now explicit.

This pipeline will execute serially as written and desired (assuming the trigger_rule is "all_success" which is the default). The "remove_cluster" task won't execute until both the "create_job_flow" and "alter_partitions" tasks are complete (which is effectively a serial execution).

Josh Fell
  • 2,959
  • 1
  • 4
  • 15