0

We are using Cloud Composer in GCP (managed Airflow on a Kubernetes cluster) for scheduling our ETL pipelines.

Our DAGs (200-300) are dynamic, meaning all of them are generated by a single generator DAG. In Airflow 1.x it was an antipattern due to the limitations of scheduler. However, scheduler is better in Airflow 2.x to handle this scenario. See the 3. point here.

We have a pretty powerful environment (see the technical details below), however we are experiencing big latency between task changes which is a bad sign for the scheduler. Additionally, lots of tasks are waiting in the queue which is a bad sign for the workers. These performance problems are present when 50-60 DAGs are getting triggered and run. This concurrency is not that big in my opinion.

We are using Cloud Composer which has autoscaling feature according to the documentation. As I mentioned, tasks are waiting in the queue for a long time, so we would expect that the resources of workers are not enough so a scaling event should take place. However, that is not the case, no scaling events the load.

Composer specific details:

  • Composer version: composer-2.0.8
  • Airflow version: airflow-2.2.3
  • Scheduler resources: 4 vCPUs, 15 GB memory, 10 GB storage
  • Number of schedulers: 3
  • Worker resources: 4 vCPUs, 15 GB memory, 10 GB storage
  • Number of workers: Auto-scaling between 3 and 12 workers

Airflow specific details:

  • scheduler/min_file_process_interval: 300
  • scheduler/parsing_processes: 24
  • scheduler/dag_dir_list_interval: 300
  • core/dagbag_import_timeout: 3000
  • core/min_serialized_dag_update_interval: 30
  • core/parallelism: 120
  • core/enable_xcom_pickling: false
  • core/dag_run_conf_overrides_params: true
  • core/executor: CeleryExecutor

We do not explicitly set a value for worker_concurrency because it is automatically calculated according to this documentation. Furthermore, we have one pool with 100000 slots, however we have noticed that most of the time number of running slots are 8-10, number of queued slots are 65-85.

We are constantly monitoring our environment, but we were not able to find anything so far. We do not see any bottleneck related to worker/scheduler/database/webserver resources (CPU, memory, IO, network).

What could be the bottleneck? Any tips and tricks are more than welcomed. Thank you!

Robert
  • 127
  • 2
  • 11
  • Did you check this [documentation](https://cloud.google.com/composer/docs/composer-2/troubleshooting-scheduling#monitoring_running_and_queued_tasks) on troubleshooting queued tasks? – Sakshi Gatyan Apr 05 '22 at 12:52
  • @SakshiGatyan Hi, thanks for the documentation, but we have gone through that already, and not helped. That's why I have mentioned in this SO post that we have enough resources (After having a look at the Monitoring tab). The configurations mentioned in the linked documentation are already in place in our environment. – Robert Apr 05 '22 at 13:48
  • Since your issue seems internal it would best if you could raise a [support case](https://cloud.google.com/support/docs/procedures) with GCP if you have a support plan or create an issue on [issue tracker](https://developers.google.com/issue-tracker/guides/create-issue-ui). – Sakshi Gatyan Apr 06 '22 at 08:31

2 Answers2

0

I encountered a similar problem 2 weeks ago. And the problem was that in Airflow 2.x the DAG generator is still anti-pattern in some cases (especially if you use SLA). According to the documentation "one file can only be parsed by one FileProcessor". So the Airflow scheduler will only run one child process for all your DAGs, and the schedule pipeline will look like this:

  1. Parse all DAGs (hundreds)
  2. Check all necessary task instances (hundreds or even thousands?) for existence in the database and generate the missing ones (according to the schedule_interval parameter).
  3. Check SLA for all outstanding tasks (again - hundreds or even thousands?) and send notifications if necessary. Checking SLAs is the hardest and slowest part of this pipeline. And all these things have to be done by one single python thread. Too much, I think.

In my case, the problem was that one of the generated DAG was permanently stuck in the past because of one failed task, and the scheduler process was spending all its time checking SLAs. Disabling the SLA for that DAG solved the problem.

You can see if this is your case simply by temporarily disabling the SLA.

The correct way, I think, is the following:

  1. Split DAGs into groups (one file per group) according to some criteria.
  2. Use your own implementation of the SLA check (outside the scheduler process).
0

You should set your environment size to Medium or Large to increase the database throughput when scheduling tasks.

The documentation recommends a Large environment for ~250 DAGs, and this environment size parameter is independent from Scheduler/Worker/Webserver machine sizing.

Andy Caruso
  • 46
  • 1
  • 5