0

Requirement: Run tasks in parallel dynamically based on the number of offset values which is basically dates

As below it starts from the current date 0 to 4 days back(end_offset_days), so that each task can run in parallel with each date in parallel

start_offset_dayts/ end_offset_days can be dynamic, tomorrow it can be changed to 6 to run past days

I tried as the below date_list gives me a list of dates to be run in parallel, How do I pass it to the next tasks for for looping

with DAG(
    dag_id=dag_id,
    default_args=default_args,
    schedule_interval="0 * * * *",
    catchup=False,
    dagrun_timeout=timedelta(minutes=180),
    max_active_runs=1,
    params={},
) as dag:
    @task(task_id='datelist')
    def datelist(**kwargs):
        ti = kwargs['ti']
        import datetime
        date_list = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(0, 4)]
        return date_list

    for tss in date_list:
        jb = PythonOperator(
                task_id=jb,
                provide_context=True,
                python_callable=main_run,
                op_kwargs={
                    "start_offset_days": 0,
                    "end_offset_days": 4
                }
              )
         jb
    return dag

Belwo is xcom values from date_list enter image description here

Karthik
  • 441
  • 5
  • 17

1 Answers1

0

Create a job_list and inside the for loop do job_list.append(jb) Then the line before return dag should simply be: job_list. Then Airflow will run all those jobs in parallel. So the last part of your code should look like this:

    job_list = []
    for tss in date_list:
        jb = PythonOperator(
                task_id=jb,
                provide_context=True,
                python_callable=main_run,
                op_kwargs={
                    "start_offset_days": 0,
                    "end_offset_days": 4
                }
              )
         job_list.append(jb)
    job_list
    return dag

Instead of running each jb in the loop, appending it to the collection and running the entire collection, will make them all run in parallel.

I would also replace the first part of the DAG. I don't think it has to run as a task. So instead of:

    @task(task_id='datelist')
    def datelist(**kwargs):
        ti = kwargs['ti']
        import datetime
        date_list = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(0, 4)]
        return date_list

I would simply do it like this:

import datetime
date_list = [(datetime.date.today() - datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(0, 4)]
Ben
  • 398
  • 2
  • 8