6
  • I have a group of job units (workers) that I want to run as a DAG
  • Group1 has 10 workers and each worker does multiple table extracts from a DB. Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete
  • Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. For example:
    • Worker1 is extracting 2 tables
    • Worker2 is extracting 2 tables
    • Worker3 is extracting 1 table
    • Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads
    • Worker4...Worker10 can pick up tables as soon as threads in step1 frees up
    • As each worker completes all the 100 tables, it proceeds to step2 without waiting. Step2 has no concurrency limits

I should be able to create a single node Group1 that caters to the throttling and also have

  • 10 independent nodes of workers so I can restart them in case if anyone of it fails

I have tried explaining this in the following diagram: enter image description here

  • If any of the worker fails, I can restart it without affecting other workers. It still uses the same thread pool from Group1 so the concurrency limits are enforced
  • Group1 would complete once all elements of step1 and step2 are complete
  • Step2 doesn't have any concurrency measures

How do I implement such a hierarchy in Airflow for a Spring Boot Java application? Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.

Rafay
  • 6,108
  • 11
  • 51
  • 71
  • 1
    Have you tried looking at the connection pools concept of airflow? Airflow uses connection pools to within the hooks to control the number of parallel connections to a source – Sreenath Kamath Jun 04 '20 at 09:42
  • @SreenathKamath Can you point me to the documentation? I am aware of task pools in Airflow but not sure if you are referring to the same thing here. – Rafay Jun 05 '20 at 15:25

1 Answers1

1

These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described. However, they can be modeled as queues, and thus could be implemented with a job queue framework. Here are your two options:

Implement suboptimally as airflow DAG:

from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator

# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))

# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
    load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)

# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
    subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)

for job in jobs:
    afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)

# After _all_ table load jobs are complete, start the jobs that should be done afterwards

load_tables > afterwards

The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5.

Implement optimally with job queue:

# This is pseudocode, but could be easily adapted to a framework like Celery

# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()

def worker():

    # Work while there's at least one item in either queue
    while not table_load_queue.empty() or not afterwards_queue.empty():
        working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]

        # Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
        if sum(working_on_table_load) < 5:
            is_working_table_load = True
            task = table_load_queue.dequeue()
        else
            is_working_table_load = False
            task = afterwards_queue.dequeue()

        if task:
            after = work(task)
            if is_working_table_load:

                # After working a table load, create the job to work afterwards
                afterwards_queue.enqueue(after)

# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)

Using this approach, the cluster won't be underutilized.

Dave
  • 1,579
  • 14
  • 28