3

tl;dr, Problem framing:

Assuming I have a sensor poking with timeout = 24*60*60. Since the connection does time out occasionally, retries must be allowed. If the sensor now retries, the timeout variable is being applied to every new try with the initial 24*60*60, and, therefore, the task does not time out after 24 hrs as it was intended.

Question:

Is there a way to restrict the max-time of a task - like a meta-timeout?

Airflow-Version: 1.10.14

Walk-thorough-thru:

BASE_DIR = "/some/base/dir/"
FILE_NAME = "some_file.xlsx"
VOL_BASE_DIR = "/some/mounted/vol/"

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "start_date": "2020-11-01",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
}

dag = DAG(
    "supplier",
    default_args=default_args,
    description="ETL Process for Supplier",
    schedule_interval=None,
    catchup=False,
    max_active_runs=1,
)

file_sensor =  FileSensor(
    task_id="file_sensor",
    poke_interval=60*60,
    timeout=24*60*60,
    retries=4,
    mode="reschedule",
    filepath=os.path.join(BASE_DIR,FILE_NAME)
    fs_conn_id='conn_filesensor',
    dag=dag,
)

clean_docker_vol = InitCleanProcFolderOperator(
    task_id="clean_docker_vol",
    folder=VOL_BASE_DIR,
    dag=dag,
)

....

This DAG should run and check if a file exists. If it exists, it should continue. Occasionally, it can happen that the sensor-task is being rescheduled due to the file being provided too late (or, say, connection errors). The MAX-overall 'run-time' of the dag should NOT exceed 24 hrs. Due to the retries, however, the time does exceed the 24 hrs timeout, if the tasks fails and is being rescheduled.

Example:

  1. runs for 4 hrs (18 hrs should be left)
  2. fails
  3. up_for_retry
  4. starts again with 24 hrs timeout, not 18 hrs.

As I need to allow retries, there is not the option of just setting retries to 0 to avoid this behavior. I was rather looking for a meta-timeout variable of airflow, a hint how this can be implemented within the related classes or any other workarounds.

many thanks.

Bennimi
  • 416
  • 5
  • 14

4 Answers4

1

Have in account that timeout computation with retries have changes on version 2.2.0 (2021-20-11)

From the release notes:

"If a sensor times out, it will not retry

Previously, a sensor is retried when it times out until the number of retries are exhausted. So the effective timeout of a sensor is timeout * (retries + 1). This behaviour is now changed. A sensor will immediately fail without retrying if timeout is reached. If it’s desirable to let the sensor continue running for longer time, set a larger timeout instead."

https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html#if-a-sensor-times-out-it-will-not-retry

Gonzalo Odiard
  • 1,238
  • 12
  • 19
0

You can use the poke_interval parameter to configure the poking frequency within the predefined timeout. Something like this: MySensor(..., retries=0, timeout=24*60*60, poke_interval=60*60). In this example the sensor will poke every hour and if it will not succeed within a day it will fail.

SergiyKolesnikov
  • 7,369
  • 2
  • 26
  • 47
  • Yeah, but this does not solve my problem that, once the connection gets lost, the task will fail (which should not happen, therefore retries should not be ZERO) – Bennimi Apr 27 '21 at 12:24
  • It is unclear from the question what connection and what task you mean. Maybe a more complete description with a minimal and reproducible code example will make it more clear. https://stackoverflow.com/help/how-to-ask – SergiyKolesnikov Apr 27 '21 at 13:50
0

I implemented a rather hacky solution that yet works for me.

  • Added a new function to the sensor-class:
    def _apply_meta_timeout(self,context):

        if not self.meta_task_timeout:
            return None
        elif self.meta_task_timeout and self.retries == 0:
            raise ValueError("'Meta_task_timeout' cannot be applied if 'retries' are set to 0. Use 'timeout' instead.")

        if isinstance(self.meta_task_timeout,datetime.timedelta):
            self.meta_task_timeout = meta_task_timeout.seconds
        if not isinstance(self.meta_task_timeout,(int,float)):
            raise ValueError("Cannot covert 'meta_task_timeout' to type(int) or type(float).")

        if self.meta_task_timeout < self.timeout:
             raise ValueError("'meta_task_timeout' cannot be less than 'timeout' variable.")
        
        logging.info(f"Get current dagrun params: {context['ti'].task_id}, {context['ti'].dag_id}, {context['ti'].execution_date}, {context['ti'].try_number}" )
        pg_hook = PostgresHook(postgres_conn_id="airflow-metadata-db")
        pg_cur = pg_hook.get_cursor()
        if not context['ti'].try_number == 1:
            try:  
                query = f"""
                select start_date from task_fail
                    where task_id='{context['ti'].task_id}' 
                    and dag_id='{context['ti'].dag_id}' 
                    and execution_date ='{context['ti'].execution_date}' 
                    order by start_date asc 
                    LIMIT 1;"""
                pg_cur.execute(query)
                init_start_timestamp = pg_cur.fetchone()[0] #.isoformat()
            except Exception as e:
                raise ConnectionError("Connection failed with error: " + str(e) )
            finally:
                pg_cur.close(), pg_hook.get_conn().close()
        else:
            init_start_timestamp = context['ti'].start_date #.isoformat()

        logging.info(f"Initial dag startup: {init_start_timestamp}")


        if (timezone.utcnow() - init_start_timestamp).total_seconds() > self.meta_task_timeout:
            if self.soft_fail:
                self._do_skip_downstream_tasks(context)
            raise AirflowSkipException('Snap. Maximal task runtime is UP.')

        logging.info(f"Time left until 'meta_time_out' applies: {self.meta_task_timeout - (timezone.utcnow() - init_start_timestamp).total_seconds()} second(s).
  • Overrode/added to poke-function:
 def poke(self, context):
...
...
        # check for meta-time-out
        self._apply_meta_timeout(context)
  • Added airflow database connection as: airflow-metadata-db

  • Called the Sensor Operator with additional params:

    dummy_sensor = FileSensor(
        task_id="file_sensor",
        remote_path=os.path.join(REMOTE_INPUT_PATH, REMOTE_INPUT_FILE),
        do_xcom_push=False,
        timeout= 60, 
        retries=2, 
        mode="reschedule",
        meta_task_timeout=5*60,
        soft_fail=True,
        #context=True,
    )

The main issue why this workaround must be applied is that airflow seems to override the initial start_date of each individual DAG-try.

Please feel free to add any suggestions of improvements. Thanks

Bennimi
  • 416
  • 5
  • 14
0

This is why we use task_retries and retry_delay for sensors instead of using poke_interval and timeout. Retries achieve exactly what you want to do. In your task definition, use

retries: 24,
retry_delay: 60*60

instead of

poke_interval=60*60,
timeout=24*60*60,
retries=4,

where by the way you should add mode="reschedule, so that your sensor don't take a slot for its whole execution time (here, your task uses a whole slot during 24 hours which sleeps most of the time).

The start_date of each dag run shouldn't be overwritten by Airflow and should be available through {{ ds }} (which is the start of the data interval) or {{ data_interval_end }}(see Airflow Documentation).

Amandine
  • 1
  • 1