Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

1 answer

Finding mean median using python hadoop streaming

Very dumb question.. I have data as following id1, value 1, 20.2 1,20.4 .... I want to find the mean and median of id1? (Note.. mean, median for each id and not the global mean,median) I am using python hadoop streaming.. mapper.py for line in…

hadoop hadoop-streaming

asked Apr 01 '13 at 21:45

frazman

32,081
75
184
269

votes

1 answer

What is job.get() and job.getBoolean() in mapreduce

I am working on pdf document clustering over hadoop so I am learning mapreduce by reading some examples on internet.In wordcount examples have lines job.get("map.input.file") job.getboolean() What is function of these functions?what is exactly…

dictionary hadoop mapreduce hdfs hadoop-streaming

asked Apr 01 '13 at 10:37

user2200278

votes

2 answers

Processing XML With Hadoop Streaming failed

i did bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -inputreader "StreamXmlRecordReader, begin=,end=" -input /user/root/xmlpytext/metaData.xml -mapper /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py -file…

xml hadoop hadoop-streaming

asked Mar 22 '13 at 04:17

USB

6,019
15
62
93

votes

1 answer

hadoop streaming -file option to pass multiple files

I need to pass in multiple files to the hadoop streaming job. As per the doc, -file option does take the directory as an input as well. however it does not seem to work. The reducer throws a file not found error. The other options are to pass each…

hadoop hadoop-streaming

asked Feb 27 '13 at 19:40

akshit

votes

1 answer

Hadoop Streaming - Module dependency

Is there any standard way in hadoop streaming to handle dependencies similar to the DistributedCache(in java MR) Say for example i have a python module to be used in all map task. How i can achieve it?

python hadoop hadoop-streaming

asked Feb 26 '13 at 09:21

user703555

votes

1 answer

Hadoop Streaming - Perl module dependency

When using Perl script as mapper & reducer in Hadoop streaming, how we can manage perl module dependencies. I want to use "Net::RabbitMQ" in my perl mapper & reducer script. Is there any standard way in perl/hadoop streaming to handle dependencies…

perl hadoop perl-module hadoop-streaming

asked Feb 26 '13 at 04:36

user703555

votes

1 answer

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code: for line in sys.stdin: data =…

hadoop hadoop-streaming

asked Feb 19 '13 at 17:08

Shane

2,315
3
21
33

votes

1 answer

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to localhost/127.0.0.1:54310 failed on local exception

I am getting an error in starting the data node while initiating the single node cluster set up on my machine ************************************************************/ 2013-02-18 20:21:32,300 INFO…

hadoop mapreduce hadoop-streaming

asked Feb 19 '13 at 02:27

somnathchakrabarti

3,026
10
69
92

votes

2 answers

Compiling Apache Hadoop Source in Eclipse

After about 4 tries I've managed to use git to checkout apache's Hadoop source code, issue a mvn eclipse:eclipse command and then import all of the projects into eclipse. So far this has been the most successful I have been. I am ALMOST there. I…

maven hadoop hadoop-streaming

asked Feb 11 '13 at 06:15

Tastybrownies

votes

1 answer

Hadoop environment variables

I'm trying to debug some issues with a single node Hadoop cluster on my Mac. In all the setup docs it says to add: export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" to remove this…

ruby hadoop hadoop-streaming

asked Feb 08 '13 at 15:06

ScotterC

1,054
1
10
23

votes

0 answers

can we append into existing file using sync() or syncFs() methods?

I made my own jar in hadoop and also own shell script shell script runs my jar my hadoop jar working is, delete prev. dir if already exist because hadoop doesn't support APPEND and create new dir but the problem is because of my service running…

hadoop mapreduce hadoop-streaming

asked Jan 23 '13 at 14:37

Sarde

votes

2 answers

Running Streaming job in hadoop using Java Apis

I am new to hadoop and learning about streaming jobs. Can anybody guide me regarding how to run Streaming Jobs through Java code? Thanks in Advance.

java hadoop hadoop-streaming

asked Jan 22 '13 at 16:21

Ajn

votes

1 answer

AWS Elastic MapReduce Streaming. Use data from nested folders as input

I have data located in structure s3n://bucket/{date}/{file}.gz with > 100 folders. How to setup streaming job and use all of them as input? Specifying s3n://bucket/ didn't help since nodes are folders.

hadoop-streaming elastic-map-reduce

asked Jan 16 '13 at 15:51

varela

1,281
1
10
16

votes

1 answer

Hadoop - basic + streaming guidance required

I have written a few MapReduce programs in Apache Hadoop 0.2.x versions - in simple words, I'm a beginner. I am attempting to process a large(over 10GB) SegY file on a Linux machine using a software called SeismicUnix The basic commands that I…

hadoop mapreduce hadoop-streaming

asked Jan 15 '13 at 08:38

Kaliyug Antagonist

3,512
9
51
103

votes

4 answers

Data Node not started

I configured hadoop setting in my box and worked with example programs everything went fine and worked well all the Daemons also is in the running state. On the next day morning Data node not running.

hadoop hadoop-streaming hadoop-plugins hadoopy

asked Jan 08 '13 at 07:35

Nagarjuna Vissa

Prev 1 2 3

…

58 59 Next