We are Running Airflow 1.10.1 with Celery. Facing multiple open connections. At the time DAGs kicks in - UI hangs for couple of minutes.
Highlights:
- All nodes are BareMetal: CPU(s):40, MHz 2494.015, RAM 378G, 10Gb NIC -
- DB connections are not being re-used
- Connections stay open while active only 5
- Workers create hundreds of connections that’s remain open until DB cleared them (900 sec)
- each worker run 100 celery threads
MySQL> show global status like 'Thread%';
+-------------------------+--------- +
| Variable_name | Value |
+-------------------------+--------- +
| Thread pool_idle_threads | 0 |
| Thread pool_threads | 0 |
| Threads_cached | 775 |
| Threads_connected | 5323 |
| Threads_created | 4846609 |
| Threads_running | 5 |
+-------------------------+--------- +
MySQL connections:
31 - worker1
215 - worker2
349 - worker53
335 - worker54
347 - worker55
336 - worker56
336 - worker57
354 - worker58
339 - worker59
328 - worker60
333 - worker61
337 - worker62
2 - scheduler
Worker .cfg
[core]
sql_alchemy_pool_size = 5
sql_alchemy_pool_recycle = 900
sql_alchemy_reconnect_timeout = 300
parallelism = 1200
dag_concurrency = 800
non_pooled_task_slot_count = 1200
max_active_runs_per_dag = 10
dagbag_import_timeout = 30
[celery]
worker_concurrency = 100
Scheduler .cfg:
[core]
sql_alchemy_pool_size = 30
sql_alchemy_pool_recycle = 300
sql_alchemy_reconnect_timeout = 300
parallelism = 1200
dag_concurrency = 800
non_pooled_task_slot_count = 1200
max_active_runs_per_dag = 10
[scheduler]
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5
run_duration = 1800
min_file_process_interval = 10
min_file_parsing_loop_time = 1
dag_dir_list_interval = 300
print_stats_interval = 30
scheduler_zombie_task_threshold = 300
max_tis_per_query = 1024
max_threads = 29
To add, I'm running 1000 simple tasks like sleep
or ls