3

I'm a newbie with Airflow and I'm trying to figuring out which is the best approach to dynamically create a set of DAGs using the information retrieved from a DB. Currently I've thougth this possible solution:

# file: dags_builder_dag.py in DAG_FOLDER

# Get info to build required dags from DB
dag_info = api_getDBInfo()
# Dynamically create dags based on info retrieved
for dag in dag_info:
    dag_id = 'hello_world_child_{}'.format(str(dag['id']))
    default_args_child = {'owner': 'airflow', 'start_date': datetime(2021, 1, 1)}
    # Add dag to global scope to let airflow digest it.
    globals()[dag_id] = create_dag(dag_id, default_args_child)

However, if I am not wrong, all the dag files, including the one that generates all the dags in this example (dags_builder_dag.py), will be periodically parsed by Airflow and this means that the api_getDBInfo() will be executed at each parse. If I was right, which would be the best practices to avoid continuously execute api_getDBInfo(), that could be a time-consuming operation for the DB? Ideally, this information should be retrieved only when needed, let's say on a manual trigger.

Other possible workarounds that come to my mind:

  • Use a Airlfow Variable as a flag to evaluate if it's time to parse again dags_builder_dag.py This variable could be used in the following way:
# file: dags_builder_dag.py in DAG_FOLDER

buildDAGs = Variables.get('buildDAGs')
if buildDAGs == 'true':
  # Get info to build required dags from DB
  dag_info = api_getDBInfo()
  # Dynamically create dags based on info retrieved
  for dag in dag_info:
      dag_id = 'hello_world_child_{}'.format(str(dag['id']))
      default_args_child = {'owner': 'airflow', 'start_date': datetime(2021, 1, 1)}
      # Add dag to global scope to let airflow digest it.
      globals()[dag_id] = create_dag(dag_id, default_args_child)
  • Set parameter min_file_process_interval of airflow.cfg file to a higher value in order to avoid continuously parsing. However, this has the downside to add a delay on dags run-time as well.

Update

Thanks to @NicoE and @floating_hammer I found a solution that is suitable for my use case.

First try: Airflow variable as cache

I could use an Airflow Variable as a cache for data stored in the DB to avoid the continuous calls to "api_getDBInfo". In this way, however, I've another bottleneck: the Variable size. Airflow variables are key value pairs. Keys are of the length - 256. Value being stored in the metadata would be constrained by the size of the string supported by the metadata db. https://github.com/apache/airflow/blob/master/airflow/models/variable.py https://github.com/apache/airflow/blob/master/airflow/models/base.py

In my case I'm using Amazon MWAA and details related to the underlying metadatabase used by aws and its structure may be hard to find (actually I've not tried to investigate a lot). So I just performed a stress test forcing a lot of data inside the Variable to see what happens. Here below the results:

Data amount Results
~0,5 MB (current) No problems with write and read operations.
~50 MB (x100) No problems with write and read operations.
~125 MB (x250) No problems with write and read operations, but using the web console of airflow, is not possible to access the Variables section. Error 502 "Bad gateway" is returned from the server
~250 MB (x500) Write on the variable fails.

Second try: S3 file as a cache

Airflow Variables has a limit, as the previous test has shown, so I've tried to keep the same pattern, changing the Airflow Variable with an S3 file, and this works well for my specific use case considering that S3 doesn't have limits in space as Airflow variables.

Just to summarize:

  1. I've created a dag called: "sync_db_cache_dag" which every hour updates an S3 "db_cache.json" with data retrieved with api_getDBInfo(). Data is stored in JSON format.
  2. The script "dags_builder_dag.py" now retrieves data from "db_cache.json", in this way the DB is relieved of continuous calls to "api_getDBInfo".

1 Answers1

1

You could try the following steps.

  • Create a variable which will hold the configuration of tasks and number of tasks to create.

Create a DAG which gets triggered at a set frequency. The dag has two tasks.

  • Task 1 reads the database and populates the variable.
  • Task 2 reads the variables and creates the multiple Tasks.
floating_hammer
  • 409
  • 3
  • 10
  • First of all, thank you for your time :) Ok, I understand your idea, but I think that is not possible to create dags/task at run-time. From what I've understood, you can do this only at parse-time. In other words, I think is not possible to have a dag that when is triggered generates other dags or other tasks. Feel free to correct me if I'm wrong, as I said, I'm a newbie. – Andrea Del Corto Mar 08 '21 at 15:07
  • 1
    Actually, It's possible to dynamically create tasks and also to dynamically create multiple DAGs from one single dag definition (.py). Regarding to task generation you can check this [answer](https://stackoverflow.com/a/66521146/10569220) with a working example. Hope that thelps – NicoE Mar 08 '21 at 18:36
  • 1
    Andrea - As NicoE says - It is possible to create tasks and dags dynamically. here is another link for getting idea about how to do that. https://www.cloudwalker.io/2020/12/21/airflow-dynamic-tasks/ – floating_hammer Mar 09 '21 at 11:40
  • 1
    Here is another example :) https://stackoverflow.com/questions/66509378/dynamic-tasks-in-airflow-based-on-an-external-file/66521146#66521146 – floating_hammer Mar 09 '21 at 12:11
  • 1
    @NicoE and @ floating_hammer thank you for your help. Well, I think that in my case, using an Airflow Variable as a cache for data stored in the DB could be a good workaround to avoid the continuous calls to "api_getDBInfo". In this way, however, I've another bottleneck: the Variable size. Do you know guys which is the maximum size for a Variable? [Official documentation](https://airflow.apache.org/docs/apache-airflow/stable/howto/variable.html) seems to not be helpful. It's only me, or all these workarounds are a bit odd? I mean, it seems that airflow has a bad design for this kind of stuff. – Andrea Del Corto Mar 10 '21 at 10:36
  • 1
    Humm that is interesting. Never really stored large amounts of data inside an airflow variable. Airflow variables are key value pairs. Keys are of the length - 256. Value being stored in the metadata would be constrained by the size of the string supported by the metadata db. https://github.com/apache/airflow/blob/master/airflow/models/variable.py https://github.com/apache/airflow/blob/master/airflow/models/base.py – floating_hammer Mar 10 '21 at 12:29
  • 1
    How about split your actual DAG into two different DAGs, the first one could have a task, `PythonOperator` to do the reading from DB (avoiding top-level code) and set the variable you will iterate later on. The second DAG should have the dynamic DAG generation feature as it iterates on the previously created variable. – NicoE Mar 10 '21 at 14:16
  • Today I have been trying to do exactly what you suggested @NicoE, nice to hear we come to the same possible solution, maybe this means that we are on the good path. Currently, I'm trying to stress this approach by forcing a lot of data inside the Variable (up to 100/1000 times more than currently required) to see what happens. I'll let you know about the test results. – Andrea Del Corto Mar 10 '21 at 15:22
  • 1
    @floating_hammer I'm using [Amazon MWAA](https://docs.aws.amazon.com/mwaa/latest/userguide/what-is-mwaa.html) and details related to underlying metadatabase used by aws and its structure may be hard to find (actually I've not tried to investigate a lot). So for the moment, I'm happy by trying a stress test as described in the previous response to NicoE. – Andrea Del Corto Mar 10 '21 at 15:28