Resource optimization/utilization in EMR for long running job and multiple small running jobs

Question

My use-case:

We have a long running Spark job. Here after called, LRJ. This job runs once in a week.
We have multiple small running jobs that can come at any time. These jobs has high priority than the long running job.

To address this, we created YARN queues as below:

Created YARN Queues for resource management. Configured Q1 queue for long running job and Q2 queue for small running jobs.

Config:
     Q1 : capacity = 50% and it can go upto 100%
          capacity on CORE nodes = 50% and maximum 100%   
     Q2 : capacity = 50% and it can go upto 100%
          capacity on CORE nodes = 50% and maximum 100%

Issue we are facing:

When LRJ is in progress, it acquires all the resources. Multiple small running jobs waits as LRJ has acquired all the resources. Once the cluster scales up and new resources are available small running jobs get resources. However, because cluster takes time for scaling-up activity, this creates a significant delay in allocating resources to these jobs.

Update 1: We have tried using maximum-capacity config as per YARN docs but its not working as I posted in my other question here

san · Answer 1 · 2020-03-19T12:59:23.280

With more analysis, which involves discussion with some unsung heroes, we decided to apply preemption on YARN queues as per our use-case.

Jobs on Q1 queue will be preempted when following sequence of events occur:

Q1 queue is using more than the specified capacity (Example: LRJ job is using more resources than the specified on queue).
Suddenly jobs on Q2 queue gets scheduled (Example: Suddenly multiple small running jobs get triggered).

To understand preemption, read this and this

Following is the sample configuration, that we are using in our AWS CloudFormation script to launch an EMR cluster:

Capacity-Scheduler configuration:

        yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
        yarn.scheduler.capacity.root.queues: Q1,Q2
        yarn.scheduler.capacity.root.Q2.capacity: 60
        yarn.scheduler.capacity.root.Q1.capacity: 40
        yarn.scheduler.capacity.root.Q2.accessible-node-labels: "*"
        yarn.scheduler.capacity.root.Q1.accessible-node-labels: "*"
        yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity: 100
        yarn.scheduler.capacity.root.Q2.accessible-node-labels.CORE.capacity: 60
        yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.capacity: 40
        yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.maximum-capacity: 60
        yarn.scheduler.capacity.root.Q2.disable_preemption: true
        yarn.scheduler.capacity.root.Q1.disable_preemption: false

yarn-site configuration:

        yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
        yarn.resourcemanager.scheduler.monitor.enable: true
        yarn.resourcemanager.scheduler.monitor.policies: org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
        yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval: 2000
        yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill: 3000
        yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round: 0.5
        yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity: 0.1
        yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor: 1

With the above, you have to specify your jobs on the particular queue based on your use-case.

Resource optimization/utilization in EMR for long running job and multiple small running jobs

1 Answers1