RHadoop Stream Job Fail with Apache Oozie

Question

I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.

I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.

I'm trying to schedule the job with Apache Oozie, and am running into some issues:

Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, : hadoop streaming failed with error code 1

I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.

In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.

I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.

Anybody have any thoughts what might be some helpful next steps in hunting this beast down?

UPDATE:

I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr: user@host.name /usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...

When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user@host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.

What action in oozie are u using to submit the RHadoop job ? — donut, Mar 27 '14 at 03:13
I should also add that printing the final system call in the rmr2 package [here](https://github.com/RevolutionAnalytics/rmr2/blob/master/pkg/R/streaming.R#L425-L426) yields essentially identical commands, the only difference being paths to random temp directories. — user3402602, Mar 28 '14 at 18:49
Difference in environments is the reason for your issue.Are you scheduling your job in a distributed cluster,If so Oozie shell action may be submitted on a random node of the cluster and the Environment change in that nod may be affecting your job submitton — donut, Mar 30 '14 at 16:41
Agreed. I am running a distributed cluster. I added some logging to show which user/node was invoking the job, and then log the actual hadoop jar that runs (`/usr/bin/hadoop jar...`). The job fails with oozie, and then when I ssh onto the node with the user that ran the process with oozie, it runs successfully... — user3402602, Mar 31 '14 at 17:32
You ran the job in the specified node as the user which starts oozie is it ? — donut, Apr 01 '14 at 18:49
I edited the initial question to better describe the logging and which processes were being run where. It has to be an environment issues (doesn't it?), but I'm not sure what could be different... — user3402602, Apr 02 '14 at 19:47
Can you check whether two environment variables HADOOP_CMD and HADOOP_STREAMING are available in the oozie shell action you are running?You are using shell action is it not the ssh action — donut, Apr 03 '14 at 16:35
Good call. However, those two environment variables are set in `${R_HOME}/etc/Renviron`, not in a bash profile, so they get loaded anytime R gets loaded, not just on ssh login. — user3402602, Apr 04 '14 at 17:10
I had faced the same issue once using shell action to run R hadoop job.Can you try setting these two variables in the beginning of the shell script you are using to run the job. — donut, Apr 04 '14 at 18:08

RHadoop Stream Job Fail with Apache Oozie

0 Answers0