9

We moved away from the Celery Executor in Airflow 1.10.0 because of some limitations of execution and right now we're using KubernetesExecutor.

Right now we're not able to parallelize all the tasks in some DAGs even when we change the subdag_operator in the code directly: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/operators/subdag_operator.py#L38

Our expectations it's that with these modifications and using Kubernetes Executors we can fan out the execution of all tasks at the same time but we have the same behavior of the SequentialExecutor.

This is the behavior that we have right now:

enter image description here

We would like to execute all of them at the same time using KubernetesExecutor.

p u
  • 1,395
  • 1
  • 17
  • 30
Flavio
  • 759
  • 1
  • 11
  • 24
  • 1
    k8s executor of airflow works for me when paralleling tasks execution in a DAG. I suggest you retry with latest airflow release since k8s executor is pretty new – shawnzhu Dec 01 '19 at 19:56
  • 2
    Hello @shawmzhu they fixed it a few branches ago, this issue it's still on in the previous versions (Nov/2018) But thanks. – Flavio Dec 02 '19 at 09:52
  • Have you changed the subdug class to use the KubernetesExecutor as the default rather than sequentialExecutor? – bamdan Apr 20 '20 at 10:36

1 Answers1

1

Kubernetes Executor in Airflow will turn all the first level of tasks into a worker pod with Local Executor.

It means that you will get the Local Executor to execute your SubDagOperator.

In order to run the tasks under SubDagOperator after the spawning the worker pod, you will need to specify the configuration parallelism for the worker pod. So, in case you are using the YAML format for worker pod, you will need to edit it to something like this.

apiVersion: v1
kind: Pod
metadata:
  name: dummy-name
spec:
  containers:
    - args: []
      command: []
      env:
        ###################################
        # This is the part you need to add
        ###################################
        - name: AIRFLOW__CORE__PARALLELISM
          value: 10
        ###################################
        - name: AIRFLOW__CORE__EXECUTOR
          value: LocalExecutor
        # Hard Coded Airflow Envs
        - name: AIRFLOW__CORE__FERNET_KEY
          valueFrom:
            secretKeyRef:
              name: RELEASE-NAME-fernet-key
              key: fernet-key
        - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
          valueFrom:
            secretKeyRef:
              name: RELEASE-NAME-airflow-metadata
              key: connection
        - name: AIRFLOW_CONN_AIRFLOW_DB
          valueFrom:
            secretKeyRef:
              name: RELEASE-NAME-airflow-metadata
              key: connection
      envFrom: []
      image: dummy_image
      imagePullPolicy: IfNotPresent
      name: base
      ports: []
      volumeMounts:
        - mountPath: "/opt/airflow/logs"
          name: airflow-logs
        - mountPath: /opt/airflow/dags
          name: airflow-dags
          readOnly: false
        - mountPath: /opt/airflow/dags
          name: airflow-dags
          readOnly: true
          subPath: repo/tests/dags
  hostNetwork: false
  restartPolicy: Never
  securityContext:
    runAsUser: 50000
  nodeSelector:
    {}
  affinity:
    {}
  tolerations:
    []
  serviceAccountName: 'RELEASE-NAME-worker-serviceaccount'
  volumes:
    - name: dags
      persistentVolumeClaim:
        claimName: RELEASE-NAME-dags
    - emptyDir: {}
      name: airflow-logs
    - configMap:
        name: RELEASE-NAME-airflow-config
      name: airflow-config
    - configMap:
        name: RELEASE-NAME-airflow-config
      name: airflow-local-settings

Then, SubDagOperator will follow the parallelism specified to run the tasks in parallel.

Ryan Siu
  • 944
  • 4
  • 11