0

My airflow code has the below Python Operator callable where I am creating a list and pushing it to xcoms:

keys = []
values = []

def attribute_count_check(e_run_id,**context):
    
    job_run_id = int(e_run_id)
    da = "select count (distinct row_num) from dds_metadata.dds_temp_att_table where run_id ={}".format(job_run_id)
    cursor.execute(da)
    res = cursor.fetchall()
    view_res = [x for res in res for x in res]
    count_of_sql = view_res[0]
    print(count_of_sql)
    if count_of_sql < 1:
        print("deleting of cluster")
        return 'delete_cluster'    
    else :
        print("triggering attr_check")
        num_attributes_per_task = num_attr #job_config
        diff = math.ceil (count_of_sql / num_attributes_per_task)
        instance = int(diff)
        n = num_attributes_per_task
        global values
        global keys
        for r in range(1, instance+1):
            #a = r
            keys.append(r)
            lower_ranges =(n*(r-1)) +1
            upper_range = (n*(r - 1)) + n
            b =(lower_ranges,upper_range)
            values.append(b)
            task_instance = context['task_instance']
            task_instance.xcom_push(key="di_keys", value=keys)
            task_instance.xcom_push(key="di_values", value=values)

The xcoms from the job is as in the below screenshot : enter image description here

Now I am trying to fetch the values from xcoms to create cluster dynamically with the code below:

with TaskGroup('dataproc_create_cluster',prefix_group_id=False) as dataproc_create_clusters:

    for i in zip('{{ ti.xcom_pull(key="di_keys")}}','{{ ti.xcom_pull(key="di_values")}}'):
        dynmaic_create_cluster = DataprocCreateClusterOperator(
        task_id="create_cluster_{}".format(list(eval(str(i)))[0]),
        project_id='{0}'.format(PROJECT),
        cluster_config=CLUSTER_GENERATOR_CONFIG,
        region='{0}'.format(REGION),
        cluster_name="dataproc-cluster-{}-sit".format(str(i[0])),
    )

But I am getting the below error:

Broken DAG: [/opt/airflow/dags/Cluster_config.py] Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/airflow/models/baseoperator.py", line 547, in __init__
    validate_key(task_id)
  File "/usr/local/lib/python3.6/site-packages/airflow/utils/helpers.py", line 56, in validate_key
    "dots and underscores exclusively".format(k=k)
airflow.exceptions.AirflowException: The key (create_cluster_{) has to be made of alphanumeric characters, dashes, dots and underscores exclusively

So I changed the task_id as below:

task_id="create_cluster_"+re.sub(r'\W+', '', str(list(eval(str(i)))[0])),

After which I got the below error:

airflow.exceptions.DuplicateTaskIdFound: Task id 'create_cluster_' has already been added to the DAG

This made me think that the value in Xcoms is being parsed one literal at a time, so I used render_template_as_native_obj=True, .

But I am still getting the duplicate task id error

djgcp
  • 163
  • 1
  • 14

1 Answers1

0

Regarding the jinja2 templating outside of templated fields

First, you can only use jinja2 templating in templated fields. Simply said, there are two processes. One is parsing the DAG (which happens first), the other is executing the tasks. At the moment your DAG is parsed, no tasks have run yet and there is no TaskInstance available, and thus also no XCOM pull available. However, with templated fields, you can use jinja2 templating for which the value of the fields are computed at the moment your task executes. At that point, the TaskInstance and the XCOM pull is available.

For example, in a PythonOperator you can use the following templated fields;

template_fields: Sequence[str] = ('templates_dict', 'op_args', 'op_kwargs')

Changing the number of tasks based on a result of a task.

Second, you can not change the number of tasks it contains based on the output of a task. Airflow simply does not support this. There is one exception; which is using mapped tasks. There is a nice example in the docs that I copied here;

@task
def make_list():
    # This can also be from an API call, checking a database, -- almost anything you like, as long as the
    # resulting list/dictionary can be stored in the current XCom backend.
    return [1, 2, {"a": "b"}, "str"]


@task
def consumer(arg):
    print(list(arg))


with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
    consumer.expand(arg=make_list())
Jorrick Sleijster
  • 935
  • 1
  • 9
  • 22