11

I am reading list of elements from an external file and looping over elements to create a series of tasks.

For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:

A1 -> A2 ..
B1 -> B2 ...

This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.

Wondering if there is already a pattern for such kind of use cases?

Dev
  • 13,492
  • 19
  • 81
  • 174

1 Answers1

16

Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.

A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):

sample_file.json:

{
    "cities": [ "London", "Paris", "BA", "NY" ]
}

Task definition:

from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json

def _read_file():
    with open('dags/sample_file.json') as f:
        data = json.load(f)
        Variable.set(key='list_of_cities',
                     value=data['cities'], serialize_json=True)
        print('Loading Variable from file...')

def _say_hello(city_name):
    print('hello from ' + city_name)

with DAG('dynamic_tasks_from_var', schedule_interval='@once',
         start_date=days_ago(2),
         catchup=False) as dag:

    read_file = PythonOperator(
        task_id='read_file',
        python_callable=_read_file
    )

Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.

    # Top-level code
    updated_list = Variable.get('list_of_cities',
                                default_var=['default_city'],
                                deserialize_json=True)
    print(f'Updated LIST: {updated_list}')

    with TaskGroup('dynamic_tasks_group',
                   prefix_group_id=False,
                   ) as dynamic_tasks_group:

        for index, city in enumerate(updated_list):
            say_hello = PythonOperator(
                task_id=f'say_hello_from_{city}',
                python_callable=_say_hello,
                op_kwargs={'city_name': city}
            )

# DAG level dependencies
read_file >> dynamic_tasks_group

In the Scheduler logs, you will only find:

INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']

Dag Graph View:

dag graph view

With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).

Update:

  • As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
  • Does it work? Yes, totally. Is it production quality code? No.
  • What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
  • How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.
NicoE
  • 4,373
  • 3
  • 18
  • 33
  • 1
    Good answer! IMHO it's better to read from file multiple times than from a variable. But it's up to Dev and the requirements he has – Michael Korotkov Mar 07 '21 at 21:17
  • Wow a detailed answer! I was thinking about this approach. Generally, calling variables outside of the task is also considered bad. But it could be better than reading file if file call is costly. – Dev Mar 08 '21 at 10:15
  • 2
    I found many articles and discussions about dynamic task generation, (most of them from outdated versions), but in general, all of them end up proposing the same two approaches: read from a file or read from a variable, then iterate and create the tasks. I'm not talking about creating DAGs dynamically, I mean creating tasks based on the result of a previous task in the same DAG. I even found an [answer](https://stackoverflow.com/a/53686174/10569220) from one of Airflow's core committers suggesting this kind of approach. Anyway, if someone knows a better way to achieve this, give me a call! – NicoE Mar 08 '21 at 12:38
  • 1
    @NicoE what dissatisfies you in these approaches ? can you think of a better way ? – Mehdi LAMRANI Nov 06 '21 at 17:10
  • 2
    Hey @MehdiLAMRANI, I just updated the answer including mentions to updated best practices section of the Airflow docs. This subject is further explained there and also presents some code alternatives. – NicoE Nov 07 '21 at 14:24