4

We are Running Airflow 1.10.1 with Celery. Facing multiple open connections. At the time DAGs kicks in - UI hangs for couple of minutes.

Highlights:

  • All nodes are BareMetal: CPU(s):40, MHz 2494.015, RAM 378G, 10Gb NIC -
  • DB connections are not being re-used
  • Connections stay open while active only 5
  • Workers create hundreds of connections that’s remain open until DB cleared them (900 sec)
  • each worker run 100 celery threads

MySQL> show global status like 'Thread%';

+-------------------------+---------     + 
| Variable_name           | Value         |
+-------------------------+---------      +
| Thread pool_idle_threads | 0            |
| Thread pool_threads      | 0            |
| Threads_cached          | 775           |
| Threads_connected       | 5323          |
| Threads_created         | 4846609       |
| Threads_running         | 5             |
+-------------------------+---------      +

MySQL connections:

31  - worker1
215 - worker2
349 - worker53
335 - worker54
347 - worker55
336 - worker56
336 - worker57
354 - worker58
339 - worker59
328 - worker60
333 - worker61
337 - worker62
2   - scheduler

Worker .cfg

[core]
sql_alchemy_pool_size = 5
sql_alchemy_pool_recycle = 900
sql_alchemy_reconnect_timeout = 300
parallelism = 1200
dag_concurrency = 800
non_pooled_task_slot_count = 1200
max_active_runs_per_dag = 10
dagbag_import_timeout = 30
[celery]
worker_concurrency = 100

Scheduler .cfg:

   [core]
    sql_alchemy_pool_size = 30
    sql_alchemy_pool_recycle = 300
    sql_alchemy_reconnect_timeout = 300
    parallelism = 1200
    dag_concurrency = 800
    non_pooled_task_slot_count = 1200
    max_active_runs_per_dag = 10
    [scheduler]
    job_heartbeat_sec = 5
    scheduler_heartbeat_sec = 5
    run_duration = 1800
    min_file_process_interval = 10
    min_file_parsing_loop_time = 1
    dag_dir_list_interval = 300
    print_stats_interval = 30
    scheduler_zombie_task_threshold = 300
    max_tis_per_query = 1024
    max_threads = 29

To add, I'm running 1000 simple tasks like sleep or ls

Eugene Bacal
  • 101
  • 5
  • 1
    you said you are using dynamic dags. can you show us a typical "dynamic" dag? does the "dynamic" part involve a database call? You say "facing multiple open connections". To which database? Airflow metastore DB? you wrote `CPU(s):40` are you saying that every worker machine 40cpus? – dstandish Aug 29 '19 at 23:45
  • 1
    you have given your workers a _lot_ of capacity; have you scaled the database commensurately? – dstandish Aug 29 '19 at 23:53
  • Are you sure you have identical `airflow.cfg` across all machines? The [docs](https://airflow.readthedocs.io/en/1.9.0/configuration.html#scaling-out-with-celery) say `..Airflow configuration settings should be homogeneous across the cluster..`. Also I notice that `worker_concurrency = 100` is quite high; quoting [astronomer.io docs](https://www.astronomer.io/guides/airflow-scaling-workers/) `..it determines how many tasks a single worker can process..` – y2k-shubham Aug 30 '19 at 03:40
  • I have done testings on static dags and we have changed logic in dynamic, too. but it did not do any better. – Eugene Bacal Aug 30 '19 at 15:10
  • does the "dynamic" part involve a database call? It does , we were able to reduce about 10% by changing the logic, but still I see a lot of them coming. For example, I just launched 1.10.4 and from a host(scheduler, worker(100)) got 400 connections where static DAGs were running – Eugene Bacal Aug 30 '19 at 15:12
  • @y2k-shubham I do not have identical `cfg`, however, it just varies for sql_alchemy and celery threads, which should not be the case. 100 - is should not be that hight for the hardware we are running on and we are planning to triple the numbers if possible. – Eugene Bacal Aug 30 '19 at 15:20

1 Answers1

3

We were able to drop connections to 1-10 from 700-800

Two things you can do:

  1. set sql_alchemy_pool_enabled = False
  2. setup a different result_backend from DB, in our case we used redis as result_backend and MySQL as primary DB
Eugene Bacal
  • 101
  • 5