I have an Airflow DAG scheduled to run daily. When I start a backfill for the last month, Airflow will start processing the runs from oldest to newest. As a single run takes a couple of hours, which means that when a new run becomes available (a day has passed while working through the backfill), the new run will only be processed after the entire backfill has completed (causing recent data to be not available for the company). Is it possible to instruct Airflow to process runs from most recent to oldest?
6 Answers
You can do it in Airflow 1.10.3
https://airflow.apache.org/cli.html#backfill
airflow backfill --run_backwards dag_id

- 947
- 3
- 20
- 45
There is a feature request that is marked as resolved.
From the ticket details, it looks like this will be available from Airflow 1.10.3. As of this writing it has yet to be released, but presumably will be shortly.
The usage is indicated in the ticket comments:
Create backfill dagrun in reversed by setting
backfill_dagrun_order_reverse = True
under scheduler section

- 1,571
- 2
- 14
- 25
-
Thank you very much for the update AdamAL, very helpful. When I've verified the working of this feature I'll mark your question as the answer! – zeebonk Mar 21 '19 at 08:14
-
1[1.10.3b1 is now available from pip](https://pypi.org/project/apache-airflow/1.10.3b1/). According to the [release notes](https://lists.apache.org/thread.html/9a1ee332b55570f2d34a9564a06793da61e4f17589d47138f90cf7c1@%3Cdev.airflow.apache.org%3E), it is now possible to run backfill in reverse. Still a beta release, so be sure to check it out and report issues. :fireworks: – AdamAL Mar 25 '19 at 19:22
-
This feature didn't get merged so isn't available as of Airflow 2.0.1. Instead https://github.com/apache/airflow/pull/4533#issuecomment-463510714 a cli-only solution (as described in the answer by @shankshera) was adopted. – Mike Lutz Mar 10 '21 at 13:44
I don't think this is possible with the Airflow standard components.
Depending on the amount of tasks you could set all tasks to the state successful. After the run has been completed, just clear the state and the day import will run through.

- 8,033
- 6
- 26
- 41
-
I'm hoping for a hands off solution: clear all runs of last week and instruct Airflow to execute any opens runs from most recent to oldest. This is not possible? – zeebonk Jul 20 '18 at 08:09
-
Here is the reference from Airflow 1.10.9
https://airflow.apache.org/docs/apache-airflow/2.2.2/cli-and-env-variables-ref.html#backfill
In 1.10.9 this works:
airflow dags backfill --run-backwards --start "yyyy-mm-dd" --end ""yyyy-mm-dd dagid
or airflow dags backfill -B --start "yyyy-mm-dd" --end ""yyyy-mm-dd dagid

- 238
- 1
- 6
Airflow will determine the date to schedule created DAG runs by the most recent dag run for that dag.
A solution, albeit a messy solution, is to create a DAG run manually for today (ensuring you match the Dag Id
exactly, and use consistent Run Id
format as the scheduler uses). This will force Airflow to skip DAG runs that should happen up until this new DAG run execution date.
You can then duplicate the DAG itself, rename it, and set a start and end dates. The start date should be when the backfill should start, the end date to a date/time before the execution date you set for the manual DAG run. (A second before it is fine)
This will let your main DAG stay up to date, while backfilling the data. However doing this will leave your DAG history in two places. If you really care you can probably write some SQL to merge it. It may not work for every use case, depending on how your DAGs are setup, but could be a solution for you.

- 3,177
- 1
- 15
- 15
-
I have a hard time understanding your first sentence, could you elaborate? – zeebonk Jul 20 '18 at 08:03
-
Sorry had a typo that didn't help. Basically for each DAG, the scheduler is going to determine the next DagRun based on the time of the last one. If for instance, your DAG is scheduled to run every hour, and your latest DagRun was `2018-07-01 8:00:00` the scheduler will find this latest run, add an hour to it, and create the new DagRun for `2018-07-01 9:00:00`. So if you were to manually create a DagRun for `2018-07-20 00:00:00`, it should schedule the next DagRun to be `2018-07-20 01:00:00`, skipping all DagRuns between the latest run and second latest run. – cwurtz Jul 20 '18 at 10:24
The short answer to your question is no, this isn't a supported Airflow feature today. Several of us have had a similar desire for this feature under similar circumstances after a DAG gets majorly backlogged, so it may be worth adding a ticket for it on the Airflow Jira or starting a thread on the Airflow mailing list to gather more input. (After all, maybe this is a common enough scenario that we should consider officially supporting it.)
One hack you can do in the mean time is to let all of the backfills get created, marking each one as failed manually/programmatically depending on how many you have. Then, re-run the failed DAG runs from newest first instead of the normal oldest first. This isn't as easy as a built-in feature, but I've used it as a workaround under similar circumstances.
One hack to trigger "auto failed DAG runs" is to add a line that raises an exception as the first line of your first task in the DAG, then remove that line after all of the backfill DAG runs have been created.

- 12,088
- 6
- 56
- 76