We've been switching our 10 nodes cluster from MapReduce to Tez lately and we are experiencing issues with resource management since then. It seems like preemption does not work as expected :
- a very consuming job arrives it gets all free ressources
- a second job arrives and wait for resources to be freed by job1
- job2 gets a very little resource (5%) over a long time and it keeps increasing very slowly but most of the time never reach the fair share.
I'm assuming the preemption mechanism used by the FairShare yarn scheduler is not working as it should and resources only get assigned to job2 when some job1 containers are done.
I've looked into Tez doc and I could think that Tez would have been developed with the Capacity Scheduler as a defacto scheduler, but can't find any help for the FairShare scheduler.
Some conf variables used that may help :
hive.server2.tez.default.queues=default
hive.server2.tez.initialize.default.sessions=false
hive.server2.tez.session.lifetime=162h
hive.server2.tez.session.lifetime.jitter=3h
hive.server2.tez.sessions.init.threads=16
hive.server2.tez.sessions.per.default.queue=10
hive.tez.auto.reducer.parallelism=false
hive.tez.bucket.pruning=false
hive.tez.bucket.pruning.compat=true
hive.tez.container.max.java.heap.fraction=0.8
hive.tez.container.size=-1
hive.tez.cpu.vcores=-1
hive.tez.dynamic.partition.pruning=true
hive.tez.dynamic.partition.pruning.max.data.size=104857600
hive.tez.dynamic.partition.pruning.max.event.size=1048576
hive.tez.enable.memory.manager=true
hive.tez.exec.inplace.progress=true
hive.tez.exec.print.summary=false
hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
hive.tez.input.generate.consistent.splits=true
hive.tez.log.level=INFO
hive.tez.max.partition.factor=2.0
hive.tez.min.partition.factor=0.25
hive.tez.smb.number.waves=0.5
hive.tez.task.scale.memory.reserve-fraction.min=0.3
hive.tez.task.scale.memory.reserve.fraction=-1.0
hive.tez.task.scale.memory.reserve.fraction.max=0.5
yarn.scheduler.fair.preemption=true
yarn.scheduler.fair.preemption.cluster-utilization-threshold=0.7
yarn.scheduler.maximum-allocation-mb=32768
yarn.scheduler.maximum-allocation-vcores=4
yarn.scheduler.minimum-allocation-mb=2048
yarn.scheduler.minimum-allocation-vcores=1
yarn.resourcemanager.scheduler.address=${yarn.resourcemanager.hostname}:8030
yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
yarn.resourcemanager.scheduler.client.thread-count=50
yarn.resourcemanager.scheduler.monitor.enable=false
yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy