- I have a group of job units (workers) that I want to run as a DAG
- Group1 has 10 workers and each worker does multiple table extracts from a DB. Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete
- Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. For example:
- Worker1 is extracting 2 tables
- Worker2 is extracting 2 tables
- Worker3 is extracting 1 table
- Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads
- Worker4...Worker10 can pick up tables as soon as threads in step1 frees up
- As each worker completes all the 100 tables, it proceeds to step2 without waiting. Step2 has no concurrency limits
I should be able to create a single node Group1 that caters to the throttling and also have
- 10 independent nodes of workers so I can restart them in case if anyone of it fails
I have tried explaining this in the following diagram:
- If any of the worker fails, I can restart it without affecting other workers. It still uses the same thread pool from Group1 so the concurrency limits are enforced
- Group1 would complete once all elements of step1 and step2 are complete
- Step2 doesn't have any concurrency measures
How do I implement such a hierarchy in Airflow for a Spring Boot Java application? Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.