7

I am working with Apache Airflow and I have a problem with the scheduled day and the starting day.

I want a DAG to run every day at 8:00 AM UTC. So, I did:

default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': datetime(2020, 12, 7, 10, 0,0),
        'email': ['example@emaiil.com'],
        'email_on_failure': True,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(hours=5)
    }
# Never run
dag = DAG(dag_id='id', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)

The day I upload the DAG was 2020-12-07 and I wanted to run it on 2020-12-08 at 08:00:00.

I set the start_date at 2020-12-07 at 10:00:00 to avoid running it at 2020-12-07 at 08:00:00 and only trigger it the next day, but it didn't work.

Then I modified the starting day:

default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': datetime(2020, 12, 7, 7, 59,0),
        'email': ['example@emaiil.com'],
        'email_on_failure': True,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(hours=5)
    }
# Never run
dag = DAG(dag_id='etl-ca-cpke-spark_dev_databricks', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)

Now the start date is 1 minute before the DAG should run, and indeed, because the catchup is set to True, the DAG has been triggered for 2020-12-07 at 08:00:00, but it has not being triggered for 2020-12-08 at 08:00:00.

Why?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
J.C Guzman
  • 1,192
  • 3
  • 16
  • 40

1 Answers1

18

Airflow schedules tasks at the end of the interval (See documentation reference)

Meaning that when you do:

start_date: datetime(2020, 12, 7, 8, 0,0)
schedule_interval: '0 8 * * *'

The first run will kick in at 2020-12-08 at 08:00+- (depends on resources)

This run's execution_date will be: 2020-12-07 08:00

The next run will kick in at 2020-12-09 at 08:00

This run's execution_date of 2020-12-08 08:00.

Since today is 2020-12-08 the next run didn't kick in because it's not the end of the interval yet.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Elad Kalif
  • 14,110
  • 2
  • 17
  • 49
  • Then if I want to start running `2020-12-08` at `08:00` and after that run every day at `08:00` I should set the start date at start_date: `2020-12-06 08:00`? I don't understand why if start date is `2020-12-07 08:00`, it doesn't run on 2020-12-07 08:00 and 2020-12-08 08:00. – J.C Guzman Dec 08 '20 at 10:39
  • 1
    You are interpreting `start_date` like cronjob where you specify when and it's starting on that date. This is not how Airflow works. Airflow takes the start_date + interval when this period **END** it start the run. The logic behind this is when you write ETL you want to run at the end of interval over the window -> today you want to process yesterday data. – Elad Kalif Dec 08 '20 at 10:45
  • 1
    Then if I want to modify my dag to start being scheduled every day at 08:00:00 from tomorrow on, what start date I should set? – J.C Guzman Dec 08 '20 at 10:59
  • 1
    Go with the same logic. The run kicks in at the end of interval so `start_date = datetime(2020, 12, 8, 8, 0,0)` and interval of `0 8 * * *` will end at `2020-12-09 08:00` and that is when the first run will kick in. Note that this run will have execution_date of `2020-12-08 08:00` – Elad Kalif Dec 08 '20 at 11:14