6

I've got a DAG that's scheduled to run daily. In most scenarios, the scheduler would trigger this job as soon as the execution_date is complete, i.e., the next day. However, due to upstream delays, I only want to kick off the dag run for the execution_date three days after execution_date. In other words, I want to introduce a three day lag.

From the research I've done, one route would be to add a TimeDeltaSensor at the beginning of my dag run with delta=datetime.timedelta(days=3).

However, due to the way the Airflow scheduler is implemented, that's problematic. Under this approach, each of my DAG runs will be active for over three days. My DAG has lots of tasks, and if several DAG runs are active, I've noticed that the scheduler eats up lots of CPU because it's constantly iteration over all these tasks (even inactive tasks). So is there another way to just tell the scheduler to not kick off the DAG run until three days have passed?

conradlee
  • 12,985
  • 17
  • 57
  • 93

2 Answers2

3

It might be easier to manipulate the date variable within the DAG.

I am assuming you would be using the execution date ds in your task instances in some way, like querying data for the given day.

In this case you could use the built in macros to manipulate the date like macros.ds_add(ds, -3) to simply adjust the date to minus 3 days.

You can use it in a template field as usual '{{ macros.ds_add(ds, -3) }}'

Macro docs here

Blakey
  • 803
  • 7
  • 15
  • 1
    In one sense this would work: it'll prevent the scheduler from processing a DAG run for three days, and it'll also add the desired delay. However, it'll lead to confusion: people in my org who view DAG runs in the airflow UI or who run manual dag runs from the CLI will need to know that the DAG run for `2017-12-04` effectively run sets the execution date to `2017-12-01`. That'll confuse people--lots of airflow users are already confused by which day's data is processed for a given execution date, and this makes it even more confusing. – conradlee Mar 20 '18 at 07:10
  • Yes it's not elegant that's for sure, just offering potential options. My preferred route would always be sensors where possible and generally using file or data sensors as opposed to time ones. I.e. I want to know the data I need is actually available before anything runs. – Blakey Mar 20 '18 at 17:58
0

One possible solution could be to have max_active_runs set to 1 for the DAG. While this does not prevent the DAG from being active for 3 days it would prevent multiple DAG runs from being initiated.

Mask
  • 705
  • 1
  • 5
  • 10
  • This seems like the best workaround so far, because it preserves the documented notion of `execution date`-- if someone looks at a DAG run from `2017-04-01` in the UI, then it'll actually be the DAG run that processes that day's data. However, it seems like a workaround rather than the intended way of handling this situation. Isn't there a more officially supported way of handling this use-case, which seems quite common to me? – conradlee Mar 20 '18 at 07:12