8

I'm practicing a video tutorial from plural sight about Amazon EMR. I am stuck as i cannot proceed as i am getting this error

Not a valid JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar

Please note that tutorial is old and it is using a older Emr version. I am using the latest version is that a problem ?

The steps that i took are after entering the credentials in putty

1) Hadoop

2) mkdir streamingCode`

3) wget -o ./streamingCode/wordSplitter.py s3://elasticmapreduce/samples/wordcount/wordSplitter.py

4) hadoop jar contrib/streaming/hadoop-streaming.jar -files streamingCode/wordSplitter.py -mapper wordSplitter.py input s3://elasticmapreduce/samples/wordcount/input -output streamingCode/wordCountOut -reducer aggregate`

I cannot execute step 4 as i am getting the below error

Not a valid JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar

harshil bhatt
  • 152
  • 1
  • 1
  • 10

2 Answers2

10

The Hadoop streaming jar is still available in the latest release of EMR Hadoop. Starting with EMR release 4.0.0 it can be found at /usr/lib/hadoop-mapreduce/hadoop-streaming.jar.

Another good resource for differences between versions can be found at http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-release-differences.html.

ChristopherB
  • 2,038
  • 14
  • 18
  • I am running into the same issue. You answer helped me with the actual command needed here: "hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files streamingCode/wordSplitter.py -mapper wordSplitter.py -input s3://elasticmapreduce/samples/wordcount/input -output streamingCode/wordCountOut -reducer aggregate" – rhoeting Oct 18 '15 at 16:03
  • 1
    Amazon should really update their documentation to reflect this change: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-launch-emr-jobflow-cli.html this tutorial is still using the old path and their log statements don't give you the slightest clue as to why things failed. – Kyle Bridenstine May 15 '18 at 18:43
9

For the variable, HADOOP_STREAMING, obtaining the path is a bit more complicated depending on the HDP you are using.

Search for where it is located via command: find / -name 'hadoop-streaming*.jar'

Src: http://thecoatlessprofessor.com/programming/installing-r-studio-server-on-hortonworks-virtual-box-image-and-rmr2-a-k-a-rhadoop-r-package/