4

I have a DAG which

  1. downloads a csv file from cloud storage
  2. uploads the csv file to a 3rd party via https

The airflow cluster I am executing on uses CeleryExecutor by default, so I'm worried that at some point when I scale up the number of workers, these tasks may be executed on different workers. eg. worker A does the download, worker B tries to upload, but doesn't find the file (because it's on worker A)

Is it possible to somehow guarantee that both the download and upload operators will be executed on the same airflow worker?

marengaz
  • 1,639
  • 18
  • 28

3 Answers3

3

Put step 1 (the csv download) and step 2 (the csv upload) into a subdag, and then trigger it via the SubDagOperator with the executor option set to a SequentialExecutor - this will ensure that steps 1 and 2 run on the same worker.

Here is a working DAG file illustrating that concept (with the actual operations stubbed out as DummyOperators), with the download/upload steps in the context of some larger process:

from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.executors.sequential_executor import SequentialExecutor

PARENT_DAG_NAME='subdaggy'
CHILD_DAG_NAME='subby'

def make_sub_dag(parent_dag_name, child_dag_name, start_date, schedule_interval):
    dag = DAG(
        '%s.%s' % (parent_dag_name, child_dag_name),
        schedule_interval=schedule_interval,
        start_date=start_date
        )

    task_download = DummyOperator(
        task_id = 'task_download_csv',
        dag=dag
        )

    task_upload = DummyOperator(
        task_id = 'task_upload_csv',
        dag=dag
        )

    task_download >> task_upload

    return dag
main_dag = DAG(
    PARENT_DAG_NAME,
    schedule_interval=None,
    start_date=datetime(2017,1,1)
)

main_task_1 = DummyOperator(
    task_id = 'main_1',
    dag = main_dag
)

main_task_2 = SubDagOperator(
    task_id = CHILD_DAG_NAME,
    subdag=make_sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, main_dag.start_date, main_dag.schedule_interval),
    executor=SequentialExecutor(),
    dag=main_dag
)

main_task_3 = DummyOperator(
    task_id = 'main_3',
    dag = main_dag
)

main_task_1 >> main_task_2 >> main_task_3
gcbenison
  • 11,723
  • 4
  • 44
  • 82
  • I notice that SubDagOperator is now deprecated. I'm wondering if we have any way to enforce two operators to be executed in the same worker by using the `task_group` ? – Victor Mayrink Jul 17 '23 at 00:54
2

For these kinds of use cases we have two solutions:

  1. Use a network mounted drive that is shared between the two workers so that both the downloading and uploading tasks have access to the same file system
  2. Use Airflow queue that is worker specific. If there is only one worker listening to this queue you will guarantee that both will have access to the same file system. Note that each worker can listen on multiple queues so you can have it listening on the "default" queue as well as the custom one intended for this task.
cr0atIAN
  • 949
  • 10
  • 22
  • 1
    Both of these are viable workarounds, but they have their drawbacks. Using shared storage requires first that such shared storage is available to the workers, which may or may not be an important constraint; and even when it is available, local storage might have better performance depending on the application. The second approach - tying both tasks to a particular queue - seems to give up some flexibility in where the tasks can be scheduled, in adding additional workers, etc. – gcbenison Aug 25 '17 at 21:19
0

You could have the download and the upload within the same function call in one dag. This would avoid managing the complications through the dags.

Chogg
  • 389
  • 2
  • 19