0

How can a define the parameters for airflow KubernetesPodOperator make all tasks in a DAG run at the same time.

In my image below you can see that some tasks are in grey "scheduled", I want them to run all at the same time green, also make it NOT possible to run the same task more than once at a time.

SO

task1_today & task1_yesterday: Cannot run together

task1_today, task2_today, ...taskN_today: Should be running ALL together

enter image description here

This is how my DAGs are defined

Arguments

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "email_on_failure": True,
    "email": ["intelligence@profinda.com"],
    "retries": 2,
    "retry_delay": timedelta(hours=6),
    "email_on_retry": False,
    "image_pull_policy": "Always",
    "max_active_tasks": len(LIST_OF_TASKS),
}

Kubernetes pod

KubernetesPodOperator(
        namespace="airflow",
        service_account_name="airflow",
        image=DAG_IMAGE,
        image_pull_secrets=[k8s.V1LocalObjectReference("docker-registry")],
        container_resources=compute_resources,
        env_vars={
            "EXECUTION_DATE": "{{ execution_date }}",
        },
        cmds=["python3", "launcher.py", "-n", spider_name, "-r", "43000"],
        is_delete_operator_pod=True,
        in_cluster=True,
        name=f"Crawler-{normalised_name}",
        task_id=f"hydra-crawler-{normalised_name}",
        get_logs=True,
        max_active_tis_per_dag=1,  # Previously task_concurrency before Airflow 2.2
    )
The Dan
  • 1,408
  • 6
  • 16
  • 41

1 Answers1

1

task1_today & task1_yesterday: Cannot run together

This constraint should be working with setting max_active_tis_per_dag at the task level to 1 as you already have done.

task1_today, task2_today, ...taskN_today: Should be running ALL together

This constraint relates to parallelism and DAG max_active_tasks. Currently 16 tasks are running which is the default. I think the reason is that you provided max_active_tasks in your default args instead of directly to the DAG object.

I think if you add it to the dag object like this it should work:

with DAG(..., max_active_tasks=len(LIST_OF_TASKS), ...) as dag:

Alternatively you can also change the Airflow config-level setting max_active_tasks_per_dag for all DAGs.

There is this guide listing different scaling parameters that might be helpful. :)

TJaniF
  • 791
  • 2
  • 7
  • Thank you, I'll give it a try and let you know – The Dan Mar 07 '23 at 17:48
  • When you talk about adding it to the task object, do you mena both variables? max_active_tis_per_dag, max_active_tasks – The Dan Mar 07 '23 at 17:50
  • max_active_tis_per_dag is a task parameter, yes, the way you are already doing it in the KPO code you shared. max_active_tasks is a dag-level parameter, so you can add it for example after `start_date` in your DAG definition. – TJaniF Mar 07 '23 at 22:20