5

We're running Airflow 1.10.12, with KubernetesExecutor and KubernetesPodOperator. In the past few days, we’re seeing tasks getting stuck in queued state for a long time (to be honest, unless we restart the scheduler, it will remain stuck in that state), new tasks of the same DAG are getting scheduled properly.

The only thing that helps is either clearing it manually, or restarting the scheduler service

We usually see it happen when we run our E2E tests, which spawns ~20 DAG runs for everyone of our 3 DAGs, due to limited parallelism, some will be queued (which is fine by us)

These are our parallelism params in airflow.cfg

parallelism = 32
dag_concurrency = 16
max_active_runs_per_dag = 16

2 of our DAGs, overwrite the max_active_runs and set it to 10

Any idea what could be causing it?

Meny Issakov
  • 1,400
  • 1
  • 14
  • 30
  • Can you check on which node the task instances run to see if the nodes still exist ? – vdolez Jan 28 '21 at 22:38
  • @vdolez we're using spot instances, so nodes may come and go. usually, the tasks get stuck in the queued state, so it's never started running – Meny Issakov Jan 31 '21 at 09:53

0 Answers0