0

I have the default pool with 128 slots.

Now I have defined some other pools for each business_unit. A business_unit is a department, so the important data (prio 1) has the default pool available, but the prio 2 data has a pool for each business_unit.

As I have 4 business_unit I have 5 pools:

1. default         --> 128 slots
2. business_unit_A --> 8 slots
2. business_unit_B --> 8 slots
2. business_unit_C --> 8 slots
2. business_unit_D --> 8 slots

Here I have a doubt regarding how to manage the default one. As I created 4 new pools with 8 slots each one, I am using a total of 32 slots of default. Should I redefine default pool as 96 slots?

Is the total number of slots available 128 and I have to play with it as the 100% of "available resources"? Or can I add new pools with slots and airflow manages it behind it. Which one is the recommended?

A task uses by default just 1 slot? If I increase it because it’s a large task the execution time should be faster? (does it relates with host resources)

mrc
  • 2,845
  • 8
  • 39
  • 73

1 Answers1

1

Pools are a way to control/limit the resources consumed by your Airflow tasks. There is no limit on the number of pools slots, you can set it to 99999 if you like. You'll have to estimate if your hardware provides enough resources at peak moments given the number of running tasks.

By default, each task consumes one pool slot. There is however an argument pool_slots on the BaseOperator to claim more than one slot:

BashOperator(
    task_id="large_task",
    ...,
    pool_slots=5,
)

Docs: https://airflow.apache.org/docs/apache-airflow/stable/concepts/pools.html#using-multiple-pool-slots

Note: there are more settings in Airflow controlling/limiting the number of parallel tasks, see https://www.astronomer.io/guides/airflow-scaling-workers.

Bas Harenslak
  • 2,591
  • 14
  • 14
  • Does make sense to speak about pools for all the Executors? Or is like workers which doesn't make sense outside of CeleryExecutor? Then, if I get it this is like saying this taks which uses 5 pool_slots can use a bigger percentage of host resources, but as there is no maximum in number of slots, you can't estimate the percentage of the resources. Am I wrong? – mrc Apr 22 '22 at 16:07
  • I think it does make sense to talk about pools regardless of the Executor. And yes, I think you got it right, however, estimating the resources usage is where the Data Engineer ingenuity comes to shine. I like to configure the pools according with the resources and then set the tasks pools according to which resource it consumes the most (e.g, a 16GB of RAM PC would have a 32-slot ram_pool and tasks which consume 1GB would occupy 3 slots - but of course things are not generally this simple). – Guilherme Z. Santos Dec 26 '22 at 18:53