Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
5
votes
1 answer

How to do Mapper testing using MRUnit Test?

I am new to Hadoop. I want to test my mapper part alone using MRUnit Test. I have tried a lot. But i dont know how to solve the following error- "The method setMapper(Mapper) in the type MapDriver is not applicable for the arguments…
Karthick
  • 97
  • 1
  • 1
  • 7
5
votes
3 answers

Exception while connecting to mongodb in spark

I get "java.lang.IllegalStateException: not ready" in org.bson.BasicBSONDecoder._decode while trying to use MongoDB as input RDD: Configuration conf = new Configuration(); conf.set("mongo.input.uri",…
dima_mak
  • 184
  • 1
  • 1
  • 10
5
votes
2 answers

Streaming frameworks on top of Hadoop that support ORC, parquet file formats

Does Hadoop streaming support the new columnar storage formats like ORC and parquet or are there frameworks on top of Hadoop that allows you to read such formats?
viper
  • 2,220
  • 5
  • 27
  • 33
5
votes
2 answers

Opening files on HDFS from Hadoop mapreduce job

Usually, I can open a new file with something like this: aDict = {} with open('WordLists/positive_words.txt', 'r') as f: aDict['positive'] = {line.strip() for line in f} with open('WordLists/negative_words.txt', 'r') as f: aDict['negative']…
Andrew Martin
  • 5,619
  • 10
  • 54
  • 92
5
votes
4 answers

How to resolve java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2?

I am trying to execute NLTK in Hadoop environment. Following is the command which i used for execution. bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar -input /user/nltk/input/ -output /user/nltk/output1/ -file…
Praveen Gr
  • 187
  • 1
  • 9
  • 22
5
votes
2 answers

Load snappy-compressed files into Elastic MapReduce

I have a bunch of snappy-compressed server logs in S3, and I need to process them using streaming on Elastic MapReduce. How do I tell Amazon and Hadoop that the logs are already compressed (before they are pulled into HFS!) so that they can be…
Abe
  • 22,738
  • 26
  • 82
  • 111
5
votes
2 answers

Pivot table with Apache Pig

I wonder if it's possible to pivot a table in one pass in Apache Pig. Input: Id Column1 Column2 Column3 1 Row11 Row12 Row13 2 Row21 Row22 Row23 Output: Id Name Value 1 Column1 Row11 1 Column2 Row12 1 …
PokerIncome.com
  • 1,708
  • 2
  • 19
  • 30
4
votes
1 answer

mongo-hadoop connector:how to query data

I'm using hadoop mongo connector in java(spark application).I've done reading mongo db by setting this configuration Configuration mongodbConfig = new Configuration(); mongodbConfig.set("mongo.job.input.format",…
Anil
  • 618
  • 1
  • 7
  • 13
4
votes
1 answer

Python Hadoop Streaming Error "ERROR streaming.StreamJob: Job not Successful!" and Stack trace: ExitCodeException exitCode=134

I am trying to run python script on Hadoop cluster using Hadoop Streaming for sentiment analysis. The Same script I am running on Local machine which is running Properly and giving output. to run on local machine I use this command. $ cat…
MegaBytes
  • 6,355
  • 2
  • 19
  • 36
4
votes
1 answer

passing JSON argument as a string to python hadoop streaming application

I want to pass a JSON string as command line argument to my reducer.py file but I'm unable to do so. Command I execute is: hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/hadoop/mapper.py -mapper 'mapper.py' -file…
shahsank3t
  • 252
  • 1
  • 13
4
votes
1 answer

How do I create a single in-distributed-memory map from several map-only tasks?

I have several heterogeneous inputs that need to be tackled with different mappers to produce a homogeneous map that can be afterwards reduced by multiple instances of a single reducer. Can it be done in a more elegant way than concatenating outputs…
whoever
  • 575
  • 4
  • 18
4
votes
1 answer

how to write a streaming mapreduce job for warc files in python

I am trying to write a mapreduce job for warc files using WARC library of python. Following code is working for me but i need this code for hadoop mapreduce jobs. import warc f = warc.open("test.warc.gz") for record in f: print…
zahid adeel
  • 123
  • 4
4
votes
1 answer

How can to get the filename from a streaming mapreduce job in R?

I am streaming an R mapreduce job and I am need to get the filename. I know that Hadoop sets environment variables for the current job before it starts and I can access env vars in R with Sys.getenv(). I found : Get input file name in streaming…
4
votes
2 answers

Error in library(functional) : there is no package called ‘functional’ - While running MR using rmr2

I am trying to run a simple MR program using rmr2 in a single node Hadoop cluster. Here is the environment for the setup Ubuntu 12.04 (32 bit) R (Ubuntu comes with 2.14.1, so updated to 3.0.2) Installed the latest rmr2 and rhdfs from here and the…
Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
4
votes
2 answers

Post hook for Elastic MapReduce

I wonder if there is an example of post process for EMR (Elastic MapReduce)? What I am trying to achieve is send an email to group of people right after Amazon's Hadoop finished the job.
Roman Kagan
  • 10,440
  • 26
  • 86
  • 126