Airflow Sensor - timeout

Question

tl;dr, Problem framing:

Assuming I have a sensor poking with timeout = 24*60*60. Since the connection does time out occasionally, retries must be allowed. If the sensor now retries, the timeout variable is being applied to every new try with the initial 24*60*60, and, therefore, the task does not time out after 24 hrs as it was intended.

Question:

Is there a way to restrict the max-time of a task - like a meta-timeout?

Airflow-Version: 1.10.14

Walk-thorough-thru:

BASE_DIR = "/some/base/dir/"
FILE_NAME = "some_file.xlsx"
VOL_BASE_DIR = "/some/mounted/vol/"

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "start_date": "2020-11-01",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
}

dag = DAG(
    "supplier",
    default_args=default_args,
    description="ETL Process for Supplier",
    schedule_interval=None,
    catchup=False,
    max_active_runs=1,
)

file_sensor =  FileSensor(
    task_id="file_sensor",
    poke_interval=60*60,
    timeout=24*60*60,
    retries=4,
    mode="reschedule",
    filepath=os.path.join(BASE_DIR,FILE_NAME)
    fs_conn_id='conn_filesensor',
    dag=dag,
)

clean_docker_vol = InitCleanProcFolderOperator(
    task_id="clean_docker_vol",
    folder=VOL_BASE_DIR,
    dag=dag,
)

....

This DAG should run and check if a file exists. If it exists, it should continue. Occasionally, it can happen that the sensor-task is being rescheduled due to the file being provided too late (or, say, connection errors). The MAX-overall 'run-time' of the dag should NOT exceed 24 hrs. Due to the retries, however, the time does exceed the 24 hrs timeout, if the tasks fails and is being rescheduled.

Example:

runs for 4 hrs (18 hrs should be left)
fails
up_for_retry
starts again with 24 hrs timeout, not 18 hrs.

As I need to allow retries, there is not the option of just setting retries to 0 to avoid this behavior. I was rather looking for a meta-timeout variable of airflow, a hint how this can be implemented within the related classes or any other workarounds.

many thanks.

score 1 · Answer 1 · answered Mar 23 '23 at 13:27

Have in account that timeout computation with retries have changes on version 2.2.0 (2021-20-11)

From the release notes:

"If a sensor times out, it will not retry

Previously, a sensor is retried when it times out until the number of retries are exhausted. So the effective timeout of a sensor is timeout * (retries + 1). This behaviour is now changed. A sensor will immediately fail without retrying if timeout is reached. If it’s desirable to let the sensor continue running for longer time, set a larger timeout instead."

https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html#if-a-sensor-times-out-it-will-not-retry

yea.. they did some great work in 2.x .. we also finally migrated. — Bennimi, Mar 23 '23 at 18:32

score 0 · Answer 2 · answered Apr 27 '21 at 10:43

0

You can use the poke_interval parameter to configure the poking frequency within the predefined timeout. Something like this: MySensor(..., retries=0, timeout=24*60*60, poke_interval=60*60). In this example the sensor will poke every hour and if it will not succeed within a day it will fail.

answered Apr 27 '21 at 10:43

SergiyKolesnikov

7,369
2
26
47

Yeah, but this does not solve my problem that, once the connection gets lost, the task will fail (which should not happen, therefore retries should not be ZERO) – Bennimi Apr 27 '21 at 12:24
It is unclear from the question what connection and what task you mean. Maybe a more complete description with a minimal and reproducible code example will make it more clear. https://stackoverflow.com/help/how-to-ask – SergiyKolesnikov Apr 27 '21 at 13:50

score 0 · Answer 3 · answered May 10 '21 at 13:12

I implemented a rather hacky solution that yet works for me.

Added a new function to the sensor-class:

    def _apply_meta_timeout(self,context):

        if not self.meta_task_timeout:
            return None
        elif self.meta_task_timeout and self.retries == 0:
            raise ValueError("'Meta_task_timeout' cannot be applied if 'retries' are set to 0. Use 'timeout' instead.")

        if isinstance(self.meta_task_timeout,datetime.timedelta):
            self.meta_task_timeout = meta_task_timeout.seconds
        if not isinstance(self.meta_task_timeout,(int,float)):
            raise ValueError("Cannot covert 'meta_task_timeout' to type(int) or type(float).")

        if self.meta_task_timeout < self.timeout:
             raise ValueError("'meta_task_timeout' cannot be less than 'timeout' variable.")
        
        logging.info(f"Get current dagrun params: {context['ti'].task_id}, {context['ti'].dag_id}, {context['ti'].execution_date}, {context['ti'].try_number}" )
        pg_hook = PostgresHook(postgres_conn_id="airflow-metadata-db")
        pg_cur = pg_hook.get_cursor()
        if not context['ti'].try_number == 1:
            try:  
                query = f"""
                select start_date from task_fail
                    where task_id='{context['ti'].task_id}' 
                    and dag_id='{context['ti'].dag_id}' 
                    and execution_date ='{context['ti'].execution_date}' 
                    order by start_date asc 
                    LIMIT 1;"""
                pg_cur.execute(query)
                init_start_timestamp = pg_cur.fetchone()[0] #.isoformat()
            except Exception as e:
                raise ConnectionError("Connection failed with error: " + str(e) )
            finally:
                pg_cur.close(), pg_hook.get_conn().close()
        else:
            init_start_timestamp = context['ti'].start_date #.isoformat()

        logging.info(f"Initial dag startup: {init_start_timestamp}")


        if (timezone.utcnow() - init_start_timestamp).total_seconds() > self.meta_task_timeout:
            if self.soft_fail:
                self._do_skip_downstream_tasks(context)
            raise AirflowSkipException('Snap. Maximal task runtime is UP.')

        logging.info(f"Time left until 'meta_time_out' applies: {self.meta_task_timeout - (timezone.utcnow() - init_start_timestamp).total_seconds()} second(s).

Overrode/added to poke-function:

 def poke(self, context):
...
...
        # check for meta-time-out
        self._apply_meta_timeout(context)

Added airflow database connection as: airflow-metadata-db
Called the Sensor Operator with additional params:

    dummy_sensor = FileSensor(
        task_id="file_sensor",
        remote_path=os.path.join(REMOTE_INPUT_PATH, REMOTE_INPUT_FILE),
        do_xcom_push=False,
        timeout= 60, 
        retries=2, 
        mode="reschedule",
        meta_task_timeout=5*60,
        soft_fail=True,
        #context=True,
    )

The main issue why this workaround must be applied is that airflow seems to override the initial start_date of each individual DAG-try.

Please feel free to add any suggestions of improvements. Thanks

score 0 · Answer 4 · answered Jan 09 '23 at 10:48

0

This is why we use task_retries and retry_delay for sensors instead of using poke_interval and timeout. Retries achieve exactly what you want to do. In your task definition, use

retries: 24,
retry_delay: 60*60

instead of

poke_interval=60*60,
timeout=24*60*60,
retries=4,

where by the way you should add mode="reschedule, so that your sensor don't take a slot for its whole execution time (here, your task uses a whole slot during 24 hours which sleeps most of the time).

The start_date of each dag run shouldn't be overwritten by Airflow and should be available through {{ ds }} (which is the start of the data interval) or {{ data_interval_end }}(see Airflow Documentation).

answered Jan 09 '23 at 10:48

Amandine

1
1

good idea, but I don´t like that the Airflow task is then always in a failed state aka 'red'. – Bennimi Jan 11 '23 at 11:47
Actually, it would always be in 'yellow' state, aka up_for_retry, until it succeeds or fails after 24 hours. – Amandine Jan 12 '23 at 14:45
yea same same, that´s what my stakeholders did not like :D – Bennimi Jan 12 '23 at 20:05

Airflow Sensor - timeout

4 Answers4