6

I have a DAG. Here is a sample of the parameters.

dag = DAG(
    'My Dag',
    default_args=default_args,
    description='Cron Job : My Dag',
    schedule_interval='45 07 * * *',
    # start_date=days_ago(0),
    start_date = datetime(2021, 4, 6, 10, 45),
    tags=['My Dag Tag'],
    concurrency = 1,
    is_paused_upon_creation=True,
    catchup=False # Don’t run previous and backfill; run only latest
)

Reading the documentation from Apache Airflow, I think I have set the DAG to run at 7:45 every day. However, if I pause the DAG and unpause it a couple of days later, it still runs as soon as I unpause it (of course, for that day) as catch=False which avoids backfills.

That is not the expected behaviour, right?

I mean, I scheduled it on 7:45. When I unpause it at 10:00, it should not be running at all until the next 7:45.

What am I missing here?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
raaj
  • 403
  • 1
  • 5
  • 17
  • This is expected. It's a bit difficult to explain it with the example you shown because your example isn't real. You provided a start_date of yesterday and you claim the dag has been paused of a few days. This doesn't make sense. Please add a real dag example with information about what runs were executed and the run that you have issue with and I will be able to explain it to you with your own example. – Elad Kalif Apr 19 '21 at 11:18
  • This example is very much real. The start date provided is 6th of April 2021 which is not yesterday but 13 days behind. I have only changed the name of the DAG because of the confidentiality of information. Since this is a real example, please explain it if you can. – raaj Apr 19 '21 at 11:20
  • ho sorry misread. Please add the execution_dates of the runs that were created and when exactly you paused and unpaused – Elad Kalif Apr 19 '21 at 11:23
  • The last time it ran on was 7th of April, 2021. And since then i had paused it. When i unpaused it today (it ran once) as soon as i unpaused. Is this behaviour expected? Because for setting up a cron like dag we would generally want it to run on the next schedule after unpausing. – raaj Apr 19 '21 at 11:28
  • @raaj - did you come up with a workaround? I have the exact same problem. I don't want anything to run "right now". I want crontime behavior – user3240688 Oct 13 '21 at 17:02
  • @user3240688 Not yet mate, i unpause it at the time i want it running, so it seems like the first run is in the correct expected time. – raaj Oct 15 '21 at 08:14
  • @raaj - what do you mean "unpause it at the time i want it running"? Like you have a job at 7:45 everyday, and you paused it. When it's time to unpause, you wait until 7:45 the next day to unpause? – user3240688 Oct 15 '21 at 13:10
  • @user3240688 If you want to schedule your job at 7:45 everyday why would you pause it? If you pause it for a few days and want to unpause it, have the backfill parameter set to False so that it does not run the previous scheduled runs. I am handling the problem of first time running by keeping the jobs paused on creation by default and unpausing the job (expecting it to run once) and then leaving it on schedule. – raaj Oct 25 '21 at 10:44
  • @raaj, same situation I am going through i.e. Airflow always kicks in a DAG run immediately so I see two runs happening subsequently (one is manual and second is scheduled). So yes I want exactly the same behaviour that Airflow should always consider the next scheduled day for run. – thedevd Feb 21 '23 at 08:38

1 Answers1

4

I assume that you are familiar with the scheduling mechanism of Airflow. If this is not the case please read Problem with start date and scheduled date in Apache Airflow before reading the rest of the answer.

As for your case:

You had one/several runs as expected when you deployed the DAG. At some point you paused the DAG on 2021-04-07, today (2021-04-19) you unpaused it. Airflow then executed a DAG run with execution_date='2021-04-18'.

This is expected.

The reason for this is based on the scheduling mechanism of Airflow.

Your last run was on 2021-04-07 and the interval is 45 07 * * * (every day at 07:45). Since you paused the DAG, the runs of 2021-04-08, 2021-04-09, ... , 2021-04-17 were never created. When you unpaused the DAG, Airflow didn't create these runs because of catchup=False, however, today's run (2021-04-19) isn't part of the catchup. It was scheduled because the interval of execution_date=2021-04-18 has reached its end cycle, and thus started running.

The behavior that you are experiencing isn't different than deploying this fresh DAG:

from airflow.operators.dummy_operator import DummyOperator
default_args = {
    'owner': 'airflow',
    'start_date': datetime(2020, 1, 1),

}
with DAG(dag_id='stackoverflow_question',
         default_args=default_args,
         schedule_interval='45 07 * * *',
         catchup=False
         ) as dag:
    DummyOperator(task_id='some_task')

As soon as you will deploy it, a single run will be created:

Enter image description here

Enter image description here

The DAG's start_date is 2020-01-01 with catchup=False. I deployed the DAG today (19/Apr/2021), so it created a run with execution_date='2021-04-18' that started to run today, 2021-04-19.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Elad Kalif
  • 14,110
  • 2
  • 17
  • 49
  • Thank you for the explanation. Really appreciate it. However I am in a situation where I would rather have my DAGs run at around the same schedule everyday or not at all. This behaviour is somehow counter intuitive to that. Ideally one would expect something to run at the given schedule time. Is there a way to achieve this behaviour in Airflow? That is skipping this instance of run when unpausing a DAG and running the next one onwards. – raaj Apr 19 '21 at 12:14
  • @raaj if you want your unpaused DAG to start with the run for April 19 then unpause it on April 20. – SergiyKolesnikov Apr 19 '21 at 12:54
  • @raaj This was explanation of Airflow behavior. You are now asking how to workaround it . Again this is a matter of understanding the scheduling mechanism. If you want to parse data of 2021-04-18 then unpause the dag at 2021-04-19. – Elad Kalif Apr 19 '21 at 13:05
  • 1
    So essentially to maintain the ETL logic it is set up to have a behaviour of not following the specified cron which again is a very major deciding characteristic of Airflow. For example if i would want to have a pipeline to do a process at a specified time in a day or rather not do it at all (at any other time), Airflow it seems is not an adequate tool judging by its behaviour. Because it will run the previous backlog of atleast one instance immediately for which i have no control. – raaj Apr 20 '21 at 09:15
  • 2
    @raaj This is how the Airflow works. It is going to change in the future AIP-39 Richer scheduler_interval is already accepted and in progress. https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval – Elad Kalif Apr 20 '21 at 09:17
  • 1
    @raaj if issue is solved consider accepting the answer – Elad Kalif Apr 22 '21 at 15:18