Hadoop streaming fails in R

Question

I am running the sample script of RHadoop to test out the system and using the following commands.

library(rmr2)
library(rhdfs)
Sys.setenv(HADOOP_HOME="/usr/bin/hadoop")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-mapreduce/hadoop-streaming.jar")
hdfs.init()
ints = to.dfs(1:100)
calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))

But it's giving me an error like below.

>Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.AutoInputFormat not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1587)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1611)
13/08/21 18:30:25 INFO mapred.JobClient: Job complete: job_201308191923_0307
13/08/21 18:30:25 INFO mapred.JobClient: Counters: 7
13/08/21 18:30:25 INFO mapred.JobClient:   Job Counters
13/08/21 18:30:25 INFO mapred.JobClient:     Failed map tasks=1
13/08/21 18:30:25 INFO mapred.JobClient:     Launched map tasks=8
13/08/21 18:30:25 INFO mapred.JobClient:     Data-local map tasks=8
13/08/21 18:30:25 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=46647
13/08/21 18:30:25 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
13/08/21 18:30:25 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/08/21 18:30:25 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/08/21 18:30:25 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if (is.list(input)) { :
  hadoop streaming failed with error code 1

Any lead about what may be wrong here

score 2 · Answer 1 · answered Aug 23 '13 at 03:47

2

HADOOP_HOME should be a directory. HADOOP_CMD should be a program. So since they are set to the same thing, that's wrong right there. But HADOOP_CMD should supersede HADOOP_HOME so that shouldn't be the root cause. So the only option left is debugging. If you had read the debugging guide you would have digged out stderr and would know a lot more already. With the console output only, there's nothing to work on.

answered Aug 23 '13 at 03:47

piccolbo

1,305
7
17

Thanks for the comment on HADOOP_CMD and HADOOP_HOME. I have gone through some pages but "stderr" is still far from my reach. Any baby step if possible. – LonelySoul Aug 23 '13 at 04:41
The simplest way for me is to use the web UI. I console, you'll see a tracking URL, browse there. Can't explain the exact sequence of clicks, but look for failed tasks, click on one in particular and look for links to the logs in the far right of the screen. This is general Hadoop stuff so you may find a tour of the web UI among Hadoop intro material, independent of RHadoop. – piccolbo Aug 23 '13 at 16:16
Thanks a ton for that direction. I have found that stderr and stdout both are --empty-- . Does it mean that I need to change the *.jar files. Since I got that link from one person. – LonelySoul Aug 23 '13 at 16:30

Hadoop streaming fails in R

1 Answers1