3

We use airflow in a hybrid ETL system. By this I mean that some of our DAGs are not scheduled but externally triggered using the Airflow API.

We are trying to do the following: Have a sensor in a scheduled DAG (DAG1) that senses that a task inside an externally triggered DAG (DAG2) has run.

For example, the DAG1 runs at 11 am, and we want to be sure that DAG2 has run (due to an external trigger) at least once since 00:00. I have tried to set execution_delta = timedelta(hours=11) but the sensor is sensing nothing. I think the problem is that the sensor tries to look for a task that has been scheduled exactly at 00:00. This won't be the case, as DAG2 can be triggered at any time from 00:00 to 11:00.

Is there any solution that can serve the purpose we need? I think we might need to create a custom Sensor, but it feels strange to me that the native Airflow Sensor does not solve this issue.

This is the sensor I'm defining:

from datetime import timedelta
from airflow.sensors import external_task

sensor = external_task.ExternalTaskSensor(
    task_id='sensor',
    dag=dag,
    external_dag_id='DAG2',
    external_task_id='sensed_task',
    mode='reschedule',
    check_existence=True,
    execution_delta=timedelta(hours=int(execution_type)),
    poke_interval=10 * 60,  # Check every 10 minutes
    timeout=1 * 60 * 60,  # Allow for 1 hour of delay in execution
)

gontxomde
  • 132
  • 9

1 Answers1

1

I had the same problem & used the execution_date_fn parameter:

ExternalTaskSensor(
    task_id="sensor",
    external_dag_id="dag_id",
    execution_date_fn=get_most_recent_dag_run,
    mode="reschedule",

where the get_most_recent_dag_run function looks like this :

from airflow.models import DagRun

def get_most_recent_dag_run(dt):
    dag_runs = DagRun.find(dag_id="dag_id")
    dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
    if dag_runs:
        return dag_runs[0].execution_date

As the ExternalTaskSensor needs to know both the dag_id and the exact last_execution_date for cross-DAGs dependencies.

Nahid O.
  • 171
  • 1
  • 3
  • 14
  • I'm not sure to understand. For me this configuration leads the sensor to never wait since the external dag execution date always exists (as we take the most recent one). – qcha Oct 15 '22 at 14:02
  • 1
    In this configuration the sensor is waiting until the external dag is in 'success' state (default parameter) for the last execution date. If it's not in 'success' state then the sensor is rescheduled until it meets the criteria. I think that's what the question is asking for. – Nahid O. Oct 17 '22 at 15:56
  • Thank you for the precision about state. But with your solution, the sensor that runs a 11am will succeed if the latest external dag run suceeded even if it is X days old. We also needs that the latest external dag succeeded the same day. – qcha Oct 17 '22 at 17:11