My use-case:
- We have a long running Spark job. Here after called, LRJ. This job runs once in a week.
- We have multiple small running jobs that can come at any time. These jobs has high priority than the long running job.
To address this, we created YARN queues as below:
Created YARN Queues for resource management. Configured Q1 queue for long running job and Q2 queue for small running jobs.
Config:
Q1 : capacity = 50% and it can go upto 100%
capacity on CORE nodes = 50% and maximum 100%
Q2 : capacity = 50% and it can go upto 100%
capacity on CORE nodes = 50% and maximum 100%
Issue we are facing:
When LRJ is in progress, it acquires all the resources. Multiple small running jobs waits as LRJ has acquired all the resources. Once the cluster scales up and new resources are available small running jobs get resources. However, because cluster takes time for scaling-up activity, this creates a significant delay in allocating resources to these jobs.
Update 1:
We have tried using maximum-capacity
config as per YARN docs but its not working as I posted in my other question here