0

I am trying to create a DAG in Airflow 2+ which will trigger multiple data fusion pipelines using CloudDqtaFusionStartPipeline operator and they will run in parallel.

However, I want to assign the parameter values (like pipeline name, runtime argument etc.) for each data fusion pipeline dynamically, based on the output of previous Python task.

The flow I am trying is something like below.

start - read_bq - [df_1, ... df_n]

Here, read_bq is a Python task which will read the values from a BigQuery table as a list (values like pipeline name, runtime argument etc.)

Then looping over that list I will determine how many data fusion pipelines to trigger and assign the values returned from BQ to those pipelines.

The problem I am facing, neither CloudDqtaFusionStartPipeline does have any task_instance option which can be used for xcom pull, nor can I run a loop within DAG by doing xcom pull (as it works only with task).

Any technical help or suggestion is appreciated.

Thanks, Santanu

Santanu Ghosh
  • 91
  • 1
  • 8

1 Answers1

1

If I understand correctly your goal, the main issue here is to create dag that will by dynamic based on the output of your BQ query. Airflow has this functionality (but it's quite limited) and it is called dynamic task mapping. But there are a few limitations

  1. firstly not all parameters are mappable (e.g for BashOperator you can map command but not task_id)
  2. secondly if you pass multiple parameters it will create a cross product:

enter image description here

I was able to solve the issue but not happy with this solution as the execution is not visible in the logs section of the airflow ui:

from airflow.decorators import dag, task
from datetime import datetime
from airflow.operators.bash import BashOperator

@dag(
    schedule=None,
    start_date=datetime(2022, 10, 29, hour=8),
    catchup=False,
    tags=['stack'],
)
def dynamic_dag():
    """
    ### Template dag"""
    @task()
    def add(x, y):
        print(f'adding {x} to {y}')
        return x+y

    @task()
    def query_bq():
        return [('first','echo df1'), ('second', 'echo df2')]

    @task()
    def run_bash(inp):
        first, second = inp
        b = BashOperator(
            task_id=first,
            bash_command=second)
        b.execute(dict())

    #this is to show multiple parameters
    added_vals = add.expand(x=[2,3], y=[3,5])
    #this works as intended but leaves no logs in ui
    run_this = run_bash.expand(inp=query_bq())
        
dynamic_dag()

As a last try I would try to create two dags. Main and worker. Main would pass the amount of df (df_n) and all the params something similar to here

tomasz
  • 201
  • 4