Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
0
votes
1 answer

Finding mean median using python hadoop streaming

Very dumb question.. I have data as following id1, value 1, 20.2 1,20.4 .... I want to find the mean and median of id1? (Note.. mean, median for each id and not the global mean,median) I am using python hadoop streaming.. mapper.py for line in…
frazman
  • 32,081
  • 75
  • 184
  • 269
0
votes
1 answer

What is job.get() and job.getBoolean() in mapreduce

I am working on pdf document clustering over hadoop so I am learning mapreduce by reading some examples on internet.In wordcount examples have lines job.get("map.input.file") job.getboolean() What is function of these functions?what is exactly…
user2200278
  • 95
  • 1
  • 4
  • 10
0
votes
2 answers

Processing XML With Hadoop Streaming failed

i did bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -inputreader "StreamXmlRecordReader, begin=,end=" -input /user/root/xmlpytext/metaData.xml -mapper /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py -file…
USB
  • 6,019
  • 15
  • 62
  • 93
0
votes
1 answer

hadoop streaming -file option to pass multiple files

I need to pass in multiple files to the hadoop streaming job. As per the doc, -file option does take the directory as an input as well. however it does not seem to work. The reducer throws a file not found error. The other options are to pass each…
akshit
  • 11
  • 4
0
votes
1 answer

Hadoop Streaming - Module dependency

Is there any standard way in hadoop streaming to handle dependencies similar to the DistributedCache(in java MR) Say for example i have a python module to be used in all map task. How i can achieve it?
user703555
  • 265
  • 1
  • 7
  • 14
0
votes
1 answer

Hadoop Streaming - Perl module dependency

When using Perl script as mapper & reducer in Hadoop streaming, how we can manage perl module dependencies. I want to use "Net::RabbitMQ" in my perl mapper & reducer script. Is there any standard way in perl/hadoop streaming to handle dependencies…
user703555
  • 265
  • 1
  • 7
  • 14
0
votes
1 answer

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code: for line in sys.stdin: data =…
Shane
  • 2,315
  • 3
  • 21
  • 33
0
votes
1 answer

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to localhost/127.0.0.1:54310 failed on local exception

I am getting an error in starting the data node while initiating the single node cluster set up on my machine ************************************************************/ 2013-02-18 20:21:32,300 INFO…
somnathchakrabarti
  • 3,026
  • 10
  • 69
  • 92
0
votes
2 answers

Compiling Apache Hadoop Source in Eclipse

After about 4 tries I've managed to use git to checkout apache's Hadoop source code, issue a mvn eclipse:eclipse command and then import all of the projects into eclipse. So far this has been the most successful I have been. I am ALMOST there. I…
Tastybrownies
  • 897
  • 3
  • 23
  • 49
0
votes
1 answer

Hadoop environment variables

I'm trying to debug some issues with a single node Hadoop cluster on my Mac. In all the setup docs it says to add: export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" to remove this…
ScotterC
  • 1,054
  • 1
  • 10
  • 23
0
votes
0 answers

can we append into existing file using sync() or syncFs() methods?

I made my own jar in hadoop and also own shell script shell script runs my jar my hadoop jar working is, delete prev. dir if already exist because hadoop doesn't support APPEND and create new dir but the problem is because of my service running…
Sarde
  • 658
  • 1
  • 8
  • 19
0
votes
2 answers

Running Streaming job in hadoop using Java Apis

I am new to hadoop and learning about streaming jobs. Can anybody guide me regarding how to run Streaming Jobs through Java code? Thanks in Advance.
Ajn
  • 573
  • 3
  • 8
  • 18
0
votes
1 answer

AWS Elastic MapReduce Streaming. Use data from nested folders as input

I have data located in structure s3n://bucket/{date}/{file}.gz with > 100 folders. How to setup streaming job and use all of them as input? Specifying s3n://bucket/ didn't help since nodes are folders.
varela
  • 1,281
  • 1
  • 10
  • 16
0
votes
1 answer

Hadoop - basic + streaming guidance required

I have written a few MapReduce programs in Apache Hadoop 0.2.x versions - in simple words, I'm a beginner. I am attempting to process a large(over 10GB) SegY file on a Linux machine using a software called SeismicUnix The basic commands that I…
Kaliyug Antagonist
  • 3,512
  • 9
  • 51
  • 103
0
votes
4 answers

Data Node not started

I configured hadoop setting in my box and worked with example programs everything went fine and worked well all the Daemons also is in the running state. On the next day morning Data node not running.