Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

1 answer

How to do Mapper testing using MRUnit Test?

I am new to Hadoop. I want to test my mapper part alone using MRUnit Test. I have tried a lot. But i dont know how to solve the following error- "The method setMapper(Mapper) in the type MapDriver is not applicable for the arguments…

asked Aug 12 '14 at 10:51

Karthick

votes

3 answers

Exception while connecting to mongodb in spark

I get "java.lang.IllegalStateException: not ready" in org.bson.BasicBSONDecoder._decode while trying to use MongoDB as input RDD: Configuration conf = new Configuration(); conf.set("mongo.input.uri",…

mongodb exception hadoop apache-spark hadoop-streaming

asked Aug 10 '14 at 07:34

dima_mak

votes

2 answers

Streaming frameworks on top of Hadoop that support ORC, parquet file formats

Does Hadoop streaming support the new columnar storage formats like ORC and parquet or are there frameworks on top of Hadoop that allows you to read such formats?

hadoop mapreduce hive hadoop-streaming

asked Apr 03 '14 at 18:52

viper

2,220
5
27
33

votes

2 answers

Opening files on HDFS from Hadoop mapreduce job

Usually, I can open a new file with something like this: aDict = {} with open('WordLists/positive_words.txt', 'r') as f: aDict['positive'] = {line.strip() for line in f} with open('WordLists/negative_words.txt', 'r') as f: aDict['negative']…

python hadoop hadoop-streaming

asked Aug 27 '13 at 20:09

Andrew Martin

5,619
10
54
92

votes

4 answers

How to resolve java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2?

I am trying to execute NLTK in Hadoop environment. Following is the command which i used for execution. bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar -input /user/nltk/input/ -output /user/nltk/output1/ -file…

hadoop nltk hadoop-streaming

asked May 06 '13 at 11:23

Praveen Gr

votes

2 answers

Load snappy-compressed files into Elastic MapReduce

I have a bunch of snappy-compressed server logs in S3, and I need to process them using streaming on Elastic MapReduce. How do I tell Amazon and Hadoop that the logs are already compressed (before they are pulled into HFS!) so that they can be…

hadoop amazon-web-services compression hadoop-streaming emr

asked Mar 21 '13 at 21:17

Abe

22,738
26
82
111

votes

2 answers

Pivot table with Apache Pig

I wonder if it's possible to pivot a table in one pass in Apache Pig. Input: Id Column1 Column2 Column3 1 Row11 Row12 Row13 2 Row21 Row22 Row23 Output: Id Name Value 1 Column1 Row11 1 Column2 Row12 1 …

apache-pig hadoop-streaming

asked Jun 26 '12 at 18:18

PokerIncome.com

1,708
2
19
30

votes

1 answer

mongo-hadoop connector:how to query data

I'm using hadoop mongo connector in java(spark application).I've done reading mongo db by setting this configuration Configuration mongodbConfig = new Configuration(); mongodbConfig.set("mongo.job.input.format",…

java mongodb apache-spark hadoop-streaming

asked Nov 20 '15 at 16:34

Anil

votes

1 answer

Python Hadoop Streaming Error "ERROR streaming.StreamJob: Job not Successful!" and Stack trace: ExitCodeException exitCode=134

I am trying to run python script on Hadoop cluster using Hadoop Streaming for sentiment analysis. The Same script I am running on Local machine which is running Properly and giving output. to run on local machine I use this command. $ cat…

python hadoop mapreduce subprocess hadoop-streaming

asked Apr 22 '15 at 08:11

MegaBytes

6,355
2
19
36

votes

1 answer

passing JSON argument as a string to python hadoop streaming application

I want to pass a JSON string as command line argument to my reducer.py file but I'm unable to do so. Command I execute is: hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/hadoop/mapper.py -mapper 'mapper.py' -file…

python json hadoop hadoop-streaming

asked Feb 11 '15 at 06:32

shahsank3t

votes

1 answer

How do I create a single in-distributed-memory map from several map-only tasks?

I have several heterogeneous inputs that need to be tackled with different mappers to produce a homogeneous map that can be afterwards reduced by multiple instances of a single reducer. Can it be done in a more elegant way than concatenating outputs…

python hadoop mapreduce hadoop-streaming

asked Dec 17 '14 at 10:24

whoever

votes

1 answer

how to write a streaming mapreduce job for warc files in python

I am trying to write a mapreduce job for warc files using WARC library of python. Following code is working for me but i need this code for hadoop mapreduce jobs. import warc f = warc.open("test.warc.gz") for record in f: print…

python hadoop mapreduce hadoop-streaming warc

asked Jan 23 '14 at 06:53

zahid adeel

votes

1 answer

How can to get the filename from a streaming mapreduce job in R?

I am streaming an R mapreduce job and I am need to get the filename. I know that Hadoop sets environment variables for the current job before it starts and I can access env vars in R with Sys.getenv(). I found : Get input file name in streaming…

r hadoop environment-variables filenames hadoop-streaming

asked Jan 04 '14 at 01:13

Jason

votes

2 answers

Error in library(functional) : there is no package called ‘functional’ - While running MR using rmr2

I am trying to run a simple MR program using rmr2 in a single node Hadoop cluster. Here is the environment for the setup Ubuntu 12.04 (32 bit) R (Ubuntu comes with 2.14.1, so updated to 3.0.2) Installed the latest rmr2 and rhdfs from here and the…

r hadoop mapreduce hadoop-streaming revolution-r

asked Oct 09 '13 at 10:15

Praveen Sripati

32,799
16
80
117

votes

2 answers

Post hook for Elastic MapReduce

I wonder if there is an example of post process for EMR (Elastic MapReduce)? What I am trying to achieve is send an email to group of people right after Amazon's Hadoop finished the job.

hadoop amazon-web-services hadoop-streaming emr

asked Apr 03 '13 at 16:48

Roman Kagan

10,440
26
86
126

Prev 1 2 3

…

58 59 Next