12

I recently upgraded from v1.7.1.2 to v1.9.0 and after the upgrade I noticed that the CPU usage increased significantly. After doing some digging, I tracked it down to these two scheduler config options: min_file_process_interval (defaults to 0) and max_threads (defaults to 2).

As expected, increasing min_file_process_interval avoids the tight loop and drops cpu usage when it goes idle. But what I don't understand is why min_file_process_interval affects tasks execution?

If I set min_file_process_interval to 60s, it now waits no less than 60s between executing each task in my DAG, so if my dag has 4 sequential tasks it has now added 4 minutes to my execution time. For example:

start -> [task1] -> [task2] -> [task3] -> [task4]
        ^          ^          ^          ^
        60s        60s        60s        60s

I have Airflow setup in my test env and prod env. This is less of an issue in my prod env (although still concerning), but a big issue for my test env. After the upgrade the CPU usage is significantly higher so either I accept higher CPU usage or try to decrease it with a higher config value. However, this adds significant time to my test dags execution time.

Why does min_file_process_interval affect time between tasks after the DAG has been scheduled? Are there other config options that could solve my issue?

flutikoff
  • 121
  • 1
  • 3
  • Possible duplicate of [Airflow latency between tasks](https://stackoverflow.com/questions/49902599/airflow-latency-between-tasks) – Hito Apr 26 '18 at 08:22

2 Answers2

3

Another option you might want to look into is

SCHEDULER_HEARTBEAT_SEC

This setting is usually also set to a very tight interval but could loosened up a bit. This setting in combination with

MAX_THREADS

did the trick for us. The dev machines are fast enough for re-deployment but without a hot, glowing CPU which is good.

tobi6
  • 8,033
  • 6
  • 26
  • 41
  • I've played with these settings as well but they don't solve my issue. Lowering max_threads affects cpu usage but also affects runtime. It looks like the DAG needs to be 'processed' before the tasks continue so dropping this to 1 nearly doubles my runtime. I've also tried lower and higher scheduler_heartbeat_sec options but I haven't seen any change in performance. The CPU usage is coming from the process manager and that also seems to dictate when tasks are run. – flutikoff Apr 25 '18 at 18:06
  • 2
    I have a similar problem as the OP; those settings didn't help. And all I have is 1 tiny DAG with 1 tiny task in it! The DagBag fill time is 0.004 seconds. But the scheduler still delays 40+ seconds between tasks. In my case I'm running a huge backfill on the task over 1000+ days. Each task takes 3 seconds, then Airflow spins doing nothing for 40 seconds and then it schedules the next day in the backfill. I had to stop using Airflow since long-term backfill on small tasks is broken, essentially. – eraoul Oct 31 '19 at 00:08
3

The most likely cause is that there are too many python files in the dags folder, and the airflow scheduler scans the instantiated DAG too much.

It is recommended to reduce the number of dag files under scheduler and worker first. At the same time, the SCHEDULER_HEARTBEAT_SEC and MAX_THREADS values are set as large as possible.

  • 1
    This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker.](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead) – Til Mar 08 '19 at 06:56