7

I have scheduled the execution of a DAG to run daily. It works perfectly for one day.

However each day I would like to re-execute not only for the current day {{ ds }} but also for the previous n days (let's say n = 7).

For example, in the next execution scheduled to run on "2018-01-30" I would like Airflow not only to run the DAG using as execution date "2018-01-30", but also to re-run the DAGs for all the previous days from "2018-01-23" to "2018-01-30".

Is there an easy way to "invalidate" the previous execution so that a backfill is run automatically?

lucacerone
  • 9,859
  • 13
  • 52
  • 80

2 Answers2

7

You can generate dynamically tasks in a loop and pass the offset to your operator.

Here is an example with the Python one.

import airflow
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG

from datetime import timedelta


args = {
    'owner': 'airflow',
    'start_date': airflow.utils.dates.days_ago(2),
    'schedule_interval': '0 10 * * *'
}

def check_trigger(execution_date, day_offset, **kwargs):
    target_date = execution_date - timedelta(days=day_offset)
    # use target_date

for day_offset in xrange(1, 8):
    PythonOperator(
        task_id='task_offset_' + i,
        python_callable=check_trigger,
        provide_context=True,
        dag=dag,
        op_kwargs={'day_offset' : day_offset}
    )
Antoine Augusti
  • 1,598
  • 11
  • 13
  • Thanks Antoine, I ended up using an approach very similar to what you explain where several branches of the DAG are created with a for loop. What I don't like about it is that during a backfill, for each day of execution all the branches are executed, even if data has already consolidated and it is enough to process one day. I ended up using a Variable called backfill to decide whether the range should be from 1 to 8 (when not backfilling) or just 1 (when backfilling). Although I'd prefer a solution where the offset is decided based on the execution date rather than a Variable. Any idea? – lucacerone Jan 31 '18 at 05:47
  • You can add a function which decides if you need to run the backfill or not depending on the execution date. Maybe use a ShortCircuitOperator in front of your actual operator to avoid running it in specific cases? – Antoine Augusti Jan 31 '18 at 09:37
  • thanks! is execution_date always used as the first argument when you use the PythonOperator? – lucacerone Feb 06 '18 at 06:00
  • but is there a way to use `target_date` via Jinja template? i.e. to run a BQ operator with it – Maxim Volgin Jun 27 '22 at 11:18
0

Have you considered having the dag that runs once a day just run your task for the last 7 days? I imagine you’ll just have 7 tasks that each spawn a SubDAG with a different day offset from your execution date.

I think that will make debugging easier and history cleaner. I believe trying to backfill already executed tasks will involve deleting task instances or setting their states all to NONE. Then you’ll still have to trigger a backfill on those dag runs. It’ll be harder to track when things fail and just seems a bit messier.

Daniel Huang
  • 6,238
  • 34
  • 33
  • "Have you considered having the dag that runs once a day just run your task for the last 7 days?" That's exactly what I'd like to do, maybe "backfill" wasn't the right word. Could you elaborate a bit more your answer? – lucacerone Jan 29 '18 at 19:59
  • Antoine beat me to it. See his answer for how you run a task for each day in a single dag run. My suggestion of using a SubDAG is only if your job is pretty complex, otherwise you can just put your one or more operators in the for loop. – Daniel Huang Jan 31 '18 at 00:23