14

I'm testing the use of Airflow, and after triggering a (seemingly) large number of DAGs at the same time, it seems to just fail to schedule anything and starts killing processes. These are the logs the scheduler prints:

[2019-08-29 11:17:13,542] {scheduler_job.py:214} WARNING - Killing PID 199809
[2019-08-29 11:17:13,544] {scheduler_job.py:214} WARNING - Killing PID 199809
[2019-08-29 11:17:44,614] {scheduler_job.py:214} WARNING - Killing PID 2992
[2019-08-29 11:17:44,614] {scheduler_job.py:214} WARNING - Killing PID 2992
[2019-08-29 11:18:15,692] {scheduler_job.py:214} WARNING - Killing PID 5174
[2019-08-29 11:18:15,693] {scheduler_job.py:214} WARNING - Killing PID 5174
[2019-08-29 11:18:46,765] {scheduler_job.py:214} WARNING - Killing PID 22410
[2019-08-29 11:18:46,766] {scheduler_job.py:214} WARNING - Killing PID 22410
[2019-08-29 11:19:17,845] {scheduler_job.py:214} WARNING - Killing PID 42177
[2019-08-29 11:19:17,846] {scheduler_job.py:214} WARNING - Killing PID 42177
...

I'm using a LocalExecutor with a PostgreSQL backend DB. It seems to be happening only after I'm triggering a large number (>100) of DAGs at about the same time using external triggering. As in:

airflow trigger_dag DAG_NAME

After waiting for it to finish killing whatever processes he is killing, he starts executing all of the tasks properly. I don't even know what these processes were, as I can't really see them after they are killed...

Did anyone encounter this kind of behavior? Any idea why would that happen?

GuD
  • 668
  • 5
  • 9
  • What's your concurrency setting for the dag? – Chengzhi Aug 29 '19 at 18:41
  • Do you mean the max active runs per dag? The settings there are quite unclear as to what they affect, and online as well it's unclear.. Is there a specific setting I should Iook at? – GuD Aug 30 '19 at 19:22
  • Maybe it's easier if you can share the dag file? Default is 16 concurrency task, but you can bump it up. https://github.com/apache/airflow/blob/master/airflow/models/dag.py#L134 – Chengzhi Aug 30 '19 at 19:36
  • We seem to be experiencing a similar issue since upgrading to Airflow 10.5, but we haven't been able to get to the bottom of it. What version of Airflow are you running? – Louis Simoneau Sep 12 '19 at 05:22
  • @LouisSimoneau what version does not have the issue? – tooptoop4 Sep 14 '19 at 17:00

3 Answers3

7

The reason for the above in my case was that I had a DAG file creating a very large number of DAGs dynamically.

The "dagbag_import_timeout" config variable which controls "How long before timing out a python file import while filling the DagBag" was set to the default value of 30. Thus the process filling the DagBag kept timing out.

GuD
  • 668
  • 5
  • 9
  • 1
    this answer just saved me; we dont have control over this variable in AWS MWAA, however this did help me realize that our dag generator was taking too long, and splitting it up fixe the problem! – Tommy Mar 17 '21 at 16:14
4

I've had a very similar issue. My DAG was of the same nature (a file that generates many DAGs dynamically). I tried the suggested solution but it didn't work (had this value to some high already, 60 seconds, increased to 120 but my issue wasn't resolved).

Posting what worked for me in case someone else has a similar issue.

I came across this JIRA ticket: https://issues.apache.org/jira/browse/AIRFLOW-5506

which helped me resolve my issue: I disabled the SLA configuration, and then all my tasks started to run!

There can also be other solutions, as other comments in this ticket suggest.

For the record, my issue started to occur after I enabled lots of such DAGs (around 60?) that I had disabled for a few months. Not sure how the SLA affects this from technical perspective TBH, but it did.

babis21
  • 1,515
  • 1
  • 16
  • 29
0

I had a similar issue on airflow 1.10 on top of kubernetes.

Restarting all the management node and worker nodes solve the issue. They were running for one year, without reboot. It seems we need frequent maintenance reboot for all kubernetes nodes to prevent such issues.

tshrinivasan
  • 231
  • 2
  • 7