0

Problem: The start date of my DAG is not being set properly, can anyone tell me why? Here is sample code:

default_args = {
    "owner": "hello",
    "email_on_failure": "false",
    "retries": 1,
    "retry_delay": timedelta(minutes=1),
    "start_date": datetime(2022, 7, 20),
    "catchup": True,
    "schedule_interval": "@weekly",
}


def dummy_function():
    # just some test function, ignore
    file_name = str(datetime.today()) + "_dummy.csv"
    with open(file_name, "w") as f:
        pass


def trigger_extractor_lambda(ds, **kwargs):

    logging.info(ds)
    logging.info(date.fromisoformat(ds))
    # further code ...

with DAG("ufc-main-dag", default_args=default_args) as dag:
    dummy_task = PythonOperator(
        task_id="dummy_task", python_callable=dummy_function, dag=dag
    )
    # lambda pulls raw data into S3
    extractor_task = PythonOperator(
        task_id="extractor_task",
        python_callable=trigger_extractor_lambda,
        provide_context=True,
        dag=dag,
    )

dummy_task >> extractor_task

The logging of the ds shows the current date yet i explicitely set the start date to be in july. What am I missing? I am using MWAA fwiw. Thanks in advance.

user19138502
  • 43
  • 1
  • 7

3 Answers3

0

The start_date parameter does not specify the first date of a DAG run. Instead, it defines the beginning of a DAG's data interval (along with end_date and schedule_interval).

From Airflow documentation:

Similarly, since the start_date argument for the DAG and its tasks points to the same logical date, it marks the start of the DAG’s first data interval, not when tasks in the DAG will start running. In other words, a DAG run will only be scheduled one interval after start_date.

Reference: Data Interval (Airflow)

Since you have catchup=True and schedule_interval=@weekly, you'll have to set start_date=2022-07-13 (one week before 2022-07-20) if you want to have your DAG beginning running from 2022-07-20. With this configuration, the Airflow scheduler will schedule DAG runs for 2022-07-20 through the latest completed data interval.

Reference: Catchup (Airflow)

Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23
  • Sorry, my previous comment had a typo: Shouldn't there be a sequence of triggerings of dags, starting with july 27th goin up to today in 7 day intervals? that i what i am trying to achieve. – user19138502 Sep 06 '22 at 15:33
  • @user19138502, have you tried setting `start_date`, `schedule_interval`, and `catchup` parameters in the `DAG` constructor instead of the `default_args` parameter? Example: `DAG("ufc-main-dag", start_date=datetime(2022, 7, 20), schedule_interval="@weekly", catchup=True, default_args=default_args)` – Andrew Nguonly Sep 07 '22 at 02:51
  • yeah, that doesn't do anything :/ – user19138502 Sep 07 '22 at 16:27
0

It's because you're setting DAG-level settings in default_args. You need to set start_date and schedule in the DAG definition itself.

default_args is a set of args that gets passed to each Airflow operator NOT the DAG itself. -See the base definition here. You can also see the example in their base tutorial where the common DAG-level vars are set in the DAG defintion.

I'm not sure exactly how Airflow behaves if you pass it args the way you have in your sample code but the correct way to do it would be:

default_args = {
   "retries": 0 # for each Airflow task
   # any other task args to pass
}
with DAG(
    start_date = pendulum.datetime(2022, 7, 20, tz_info=UTC),
    schedule = "cron string",
    catchup = True,
    default_args = default_args
) as dag:
   # DAG things here
   # All your tasks will inherit default_args unless you explicitly set them
glob
  • 36
  • 2
  • As a side note with me not being motivated enough to google around for why this is - for some reason every Airflow usage I've seen uses `pendulum` for setting the datetime for the DAG start_date instead of just straight up datetime.datetime. No idea why that is but I trust my coworkers and would say use pendulum. – glob Sep 08 '22 at 04:09
  • Hey :) I'm not sure it makes a difference. I tried it your way and it didn't change. Also i think it's "schedule_interval" not "schedule" in the DAG() instantiation args? – user19138502 Sep 08 '22 at 11:41
  • [see here](https://stackoverflow.com/questions/62477705/airflow-not-picking-up-start-date-from-dag) – user19138502 Sep 08 '22 at 11:51
0

The solution was first to run using "backfill" not catchup, as there were previous dag runs that prevented airflow from seeing that there were more missing tasks.

user19138502
  • 43
  • 1
  • 7