I manage a cluster with several machines that is shared with other colleagues. Some using spark and some using Map Reduce.
Spark users usually open a context and have it open for days or weeks, while in MR the jobs start and finish.
The problem is a lot of times the MR job get stacked because:
- After X% of the map phase it start running reducers.
- Eventually you have a lot of reducers running and only 5-15 maps waiting to execute.
- At this point there is no enough memory to start a new map, and the reducers cannot go over 33% because the maps have not finished yet producing a deadlock.
The only way to solve this problem is by killing one of the spark context and letting the maps finish.
Is there a way to configure yarn to avoid this problem?
Thanks.