0

Is there a way to find the maximum/minimum or even an average duration of all DagRun instances in Airflow? - That is all dagruns from all dags not just one single dag.

I can't find anywhere to do this on the UI or even a page with a programmatic/command line example.

2 Answers2

1

You can use airflow- api to get all dag_runs for dag and calculate statistics.

An example to get all dag_runs per dag and calc total time :

import datetime
import requests
from requests.auth import HTTPBasicAuth

airflow_server = "http://localhost:8080/api/v1/"
auth = HTTPBasicAuth("airflow", "airflow")

get_dags_url = f"{airflow_server}dags"
get_dag_params = {
    "limit": 100,
    "only_active": "true"
}

response = requests.get(get_dags_url, params=get_dag_params, auth=auth)
dags = response.json()["dags"]

get_dag_run_params = {
    "limit": 100,
}
for dag in dags:
    dag_id = dag["dag_id"]
    dag_run_url = f"{airflow_server}/dags/{dag_id}/dagRuns?limit=100&state=success"
    response = requests.get(dag_run_url, auth=auth)
    dag_runs = response.json()["dag_runs"]
    for dag_run in dag_runs:
        execution_date = datetime.datetime.fromisoformat(dag_run['execution_date'])
        end_date = datetime.datetime.fromisoformat(dag_run['end_date'])
        duration = end_date - execution_date
        duration_in_s = duration.total_seconds()
        print(duration_in_s)
ozs
  • 3,051
  • 1
  • 10
  • 19
  • I've been testing with this and it doesn't seem correct (you used start_date for end_date but I get what you meant). Using `dag_run.start_date` seems to get the datetime of the first ever dag run and for `dag_run.end_date` getting the latest dagrun datetime then calculates a duration between those two datetimes. So for example a dag that was created 5 days ago ends up having a duration of 5 days + whatever time difference in hours, minutes, seconds - rather than producing x seconds of the actual dagrun duration. – pragmatic learner Jun 30 '22 at 13:34
  • @pragmaticlearner, you are right, you should use execution_date instead of start_date (i fixed my answer). 10x – ozs Jun 30 '22 at 13:47
0

The easiest way will be to query your Airflow metastore. All the scheduling, DAG runs, and task instances are stored there and Airflow can't operate without it. I do recommend filtering on DAG/execution date if your use-case allows. It's not obvious to me what one can do with just these three overarching numbers alone.

select
    min(runtime_seconds) min_runtime,
    max(runtime_seconds) max_runtime,
    avg(runtime_seconds) avg_runtime
from (
    select extract(epoch from (d.end_date - d.start_date)) runtime_seconds
    from public.dag_run d
    where d.execution_date between '2022-01-01' and '2022-06-30' and d.state = 'success'
)

You might also consider joining to the task_instance table to get some task-level data, and perhaps use the min start and max end times for DAG tasks within a DAG run for your timestamps.