I have Airflow DAG written as below:
with DAG(
dag_id='dag',
default_args={
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
},
dagrun_timeout=timedelta(hours=2),
start_date=datetime(2021, 9, 28, 11),
schedule_interval='10 * * * *'
) as dag:
create_job_flow = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
)
job_step = EmrAddStepsOperator(
task_id='job_step',
job_flow_id=create_job_flow.output,
aws_conn_id='aws_default',
steps=JOB_SETP,
)
job_step_sensor = EmrStepSensor(
task_id='job_step_sensor',
job_flow_id=create_job_flow.output,
step_id="{{ task_instance.xcom_pull(task_ids='job_step', key='return_value')[0] }}",
aws_conn_id='aws_default',
)
read_file = PythonOperator(
task_id="read_file",
python_callable=get_date_information
)
alter_partitions = PythonOperator(
task_id="alter_partitions",
python_callable=update_partitions
)
remove_cluster = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id=create_job_flow.output,
aws_conn_id='aws_default',
)
create_job_flow.set_downstream(job_step)
job_step.set_downstream(job_step_sensor)
job_step_sensor.set_downstream(read_file)
read_file.set_downstream(alter_partitions)
alter_partitions.set_downstream(remove_cluster)
So this is basically creating an EMR cluster, starting a Step in it and sensing that step. Then execute some Python functions and finally terminate the cluster. The view of the DAG in Airflow UI is as below:
Here, create_job_flow is also pointing to remove_cluster (maybe because the job_flow_id has a reference to create_job_flow) whereas I only set the downstream of alter_partitions to remove_cluster. Would this happen that before reaching job_step, it would remove the cluster as in that case the cluster will already be deleted before executing the split_job_step and hence that is the problem. Is there any way to remove the link between create_job_flow and remove_cluster? Or would it wait to finish alter_partitions and would then execute remove_cluster?