Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
0
votes
1 answer

Hadoop: Modify output file after it's written

Summary: can I specify some action to be executed on each output file after it's written with hadoop streaming? Basically, this is follow-up to Easiest efficient way to zip output of hadoop mapreduce question. I want for each key X its value written…
modular
  • 1,099
  • 9
  • 22
0
votes
2 answers

Ascii represantion of compressed data without certain character

I want to process a large number of pickled data with Hadoop using Python. What I am trying to do is represent my data as some key (file id) and compressed pickle as value in a large file. If I simply try to put binary code as ascii in the file…
twowo
  • 621
  • 1
  • 8
  • 15
0
votes
2 answers

Accessing Raw Data for Hadoop

I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.
Russell Asher
  • 185
  • 1
  • 13
0
votes
1 answer

Hadoop data split and data flow control

I have 2 questions for A hadoop as a storage system. I have a hadoop cluster of 3 data node and I want to direct splits of a huge file say of size 128mb (assuming that split size is 64mb ) to my choice of data node. That is how to control which…
0
votes
3 answers

Hadoop Streaming with RVM does not find Gem

Original Question (long version below). Short version: Running hadoop streaming with a ruby script as mapper and rvm installed on all cluster nodes does not work. Because ruby is not known (and rvm not being loaded correctly) by the hadoop launched…
Nicolas
  • 755
  • 9
  • 22
0
votes
1 answer

Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. on running Lucene search on Hadoop

I use each of the records in a big text file to perform a search on Lucene's index, then massage the results as i wanted and write to output. I'm trying to use Hadoop by putting the big input text file and pre-created Lucene index onto Hadoop's file…
trillions
  • 3,669
  • 10
  • 40
  • 59
0
votes
1 answer

apache Hadoop-2.0.0 aplha version installation in full cluster using fedration

I had installed hadoop stable version successfully. but confused while installing hadoop -2.0.0 version. I want to install hadoop-2.0.0-alpha on two nodes, using federation on both machines. rsi-1, rsi-2 are hostnames. what should be values of below…
0
votes
1 answer

shell script not found in hadoop

I am new to hadoop and hadoop streaming so this error is probably something obvious that I miss. I run an inline awk mapper command and it works fine. hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar -input input -output output…
PokerIncome.com
  • 1,708
  • 2
  • 19
  • 30
0
votes
1 answer

mongo-hadoop streaming mapper.py not found

I get the following error when running mongo-hadoop streaming: java.io.IOException: Cannot run program "mapper.py": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at…
jassinm
  • 7,323
  • 3
  • 33
  • 42
0
votes
1 answer

Building a Hadoop Job object for Hadoop Streaming

I am trying to configure and run a Hadoop Streaming job from Java (the system I'm working with wants the Hadoop jobs to be callable by Java method). I did find the createJob method in org.apache.hadoop.streaming.StreamJob…
Zach
  • 1,263
  • 11
  • 25
0
votes
2 answers

Split file during writing

gurus! A long time i can't found answer on following question: how hadoop splitting big file during writing. Example: 1) Block size 64 Mb 2) File size 128 Mb (flat file, containing text). When i writing file it's will be split at 2 part (file size…
Mijatovic
  • 229
  • 1
  • 3
  • 7
0
votes
1 answer

job history log file

I have program which uses hadoop vaidya tool. http://hadoop.apache.org/mapreduce/docs/r0.21.0/vaidya.html $HADOOP_HOME/contrib/vaidya/bin/vaidya.sh -jobconfig -joblog I am not able tofound job history; where can I find the job…
cldo
  • 1,735
  • 6
  • 21
  • 26
0
votes
1 answer

Hadoop: strange ClassNotFoundException

I am getting a classnotfound exception. The class which is claimed to be not found does not exist, but the class name is set as the path to the list of input files for my map reduce jobs. INFO server Running: /usr/lib/hadoop/bin/hadoop --config…
Bob
  • 991
  • 8
  • 23
  • 40
0
votes
1 answer

How do I pass the Hadoop Streaming -file flag to Amazon ElasticMapreduce?

The -file flag allows you to pack your executable files as a part of job submission and thus allow you to run a MapReduce without first manually copying the executable to S3. Is there a way to use the -file flag with Amazon's elastic-mapreduce…
tibbe
  • 8,809
  • 7
  • 36
  • 64
0
votes
2 answers

hadoop streaming getting optimal number of slots

I have a streaming map-reduce job. I have some 30 slots for processing. Initially I get a single input file containing 60 records (fields are tab separated), first field of every record is a number, for first record number(first field) is 1, for…
sunillp
  • 983
  • 3
  • 13
  • 31