0

Issue -

I am running a series of mapreduce jobs wrapped in an oozie workflow. Input data consists of a bunch text files most of which are fairly small (KBs) but every now and then I get files over 1-2 MB which cause my jobs to fail. I am seeing two reasons why jobs are failing - one, inside one or two mr jobs the file is parsed into a graph in memory and for a bigger file its mr is running out of memory and two, the jobs are timing out.

Questions -

1) I believe I can just disable the timeout by setting mapreduce.task.timeout to 0. But I am not able to find any documentation that mention any risk in doing this.

2) For the OOM error, what are the various configs I can mess around with ? Any links here on potential solutions and risks would be very helpful.

3)I see a lot of "container preempted by scheduler" messages before I finally get OOM.. is this a seperate issue or related? How do I get around this?

Thanks in advance.

chapstick
  • 713
  • 6
  • 16
  • 25

1 Answers1

1

About the timeout: no need to set it to "unlimited", a reasonably large value can do (e.g. in our Prod cluster it is set to 300000).

About requiring a non-standard RAM quota in Oozie: the properties you are looking for are probably mapreduce.map.memory.mb for global YARN container quota, oozie.launcher.mapreduce.map.java.opts to instruct the JVM about that quota (i.e. fail gracefully with OOM exception instead of crashing the container with no useful error message), and the .reduce. counterparts.

See also that post for the (very poorly documented) oozie.launcher. prefix in case you want to set properties for a non-MR Action -- e.g. a Shell, or a Java program that spawns indirectly a series of Map and Reduce steps

Community
  • 1
  • 1
Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36
  • Thank you for the answer. I forgot to mention that when I get OOO errors theres also usually "container preempted by scheduler" with it (updating my original post to include this)... any ideas if that means I should be changing some other values as well or how to get around this? – chapstick Sep 03 '15 at 20:47
  • Gaaah! Your SysAdmin has activated either **Fair Scheduler** or **Capacity Scheduler**, with **preemption** -- e.g. a "license to kill" any container that is running peacefully in a low-priority queue as soon as a high-priority queue starts howling for blood - er, for capacity. Your best hope is to bribe that SysAdmin so as to get access to a higher-priority queue (cf. property `mapred.job.queue.name`) or change the default settings for your user and/or your default queue. – Samson Scharfrichter Sep 03 '15 at 22:38
  • For an intro about preemption you may look at this post http://jason4zhu.blogspot.fr/2014/11/fair-scheduler-in-yarn-hadoop-2.2.0-experiment-on-preemption.html or that post http://www.4-traders.com/HORTONWORKS-INC-19157091/news/Hortonworks--Better-SLAs-via-Resource-preemption-in-YARNrsquos-CapacityScheduler-20570275/ (courtesy of Google) – Samson Scharfrichter Sep 03 '15 at 22:42
  • I will check it out. Thank you! – chapstick Sep 04 '15 at 12:30