Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
0
votes
3 answers

Getting friends within a specified degree with MapReduce

Do you know how can I implement this algorithm using the MapReduce paradigm? def getFriends(self, degree): friendList = [] self._getFriends(degree, friendList) return friendList def _getFriends(self, degree, friendList): …
0
votes
1 answer

DiskErrorException on slave machine - Hadoop multinode

I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files . 13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008_0, Status : FAILED Too many fetch-failures 13/07/25…
0
votes
1 answer

Error on starting HDFS daemons on hadoop Multinode cluster

Issue While Hadoop multi-node set-up .As soon as i start My hdfs demon on Master (bin/start-dfs.sh) i did got below logs on Master starting namenode, logging to…
Surya
  • 3,408
  • 5
  • 27
  • 35
0
votes
0 answers

Hadoop Streaming Job limited to 6 Maps and 6 Reduces

So I'm running a pretty basic (just a search for a simple expression) program via Hadoop streaming on my 3-node cluster. When I run the job, JobTracker informs me that only 6 maps and 6 reduces are running, with 2000 pending map and reduce jobs. Why…
Alpha
  • 11
  • 3
0
votes
3 answers

Running a R script using hadoop streaming Job Failing : PipeMapRed.waitOutputThreads(): subprocess failed with code 1

I have a R script which works perfectly fine in R Colsole ,but when I am running in Hadoop streaming it is failing with the below error in Map phase .Find the Task attempts log The Hadoop Streaming Command I have…
user1281780
  • 79
  • 1
  • 4
0
votes
2 answers

Creating more partitions than reducers

When developing locally on my single machine, I believe the default number of reducers is 6. In a particular MR step, I actually divide up the data into n partitions where n can be greater than 6. From what I have observed, it looks like only 6 of…
syker
  • 10,912
  • 16
  • 56
  • 68
0
votes
1 answer

Hadoop map reduce - access missing data

Lets say I have a client script that pulls a large size of data from hadoop. What functionality in hadoop gives me advantage of looking at the retrieved data and ask for (point out) a missing part of data, to make a specific request just to read…
Majoris
  • 2,963
  • 6
  • 47
  • 81
0
votes
1 answer

hadoop streaming error,mapreduce with python

I'm newbie to hadoop environment,Do you have any idea about how to solve this error,or what may be the reason behind this error? hduser@intel-HP-Pavilion-g6-Notebook-PC:~/hduser/hadoop$ sudo ./bin/hadoop jar…
deadendtux
  • 63
  • 2
  • 8
0
votes
2 answers

Sampling Records from Hadoop Mapper

I have a dataset whose key consists of 3 parts: a, b and c. In my mapper, I would like to emit records with the key as 'a' and the value as 'a,b,c' How do I emit 10% of the total records for each 'a' that is detected from the mapper in Hadoop?…
syker
  • 10,912
  • 16
  • 56
  • 68
0
votes
3 answers

hadoop input format for hadoop streaming. Wikihadoop Input Format

I wonder whether there is any differences between the InputFormats for hadoop and hadoop streaming. Does the Input Formats for hadoop streaming work also for hadoop and vice versa? I am asking this because I found a special Input Format for the…
user2426139
  • 53
  • 1
  • 4
0
votes
1 answer

I have tesseract-ocr and hadoop separately. I need to integrate them

According to my project of image processing. I need is to integrate hadoop (parallel processor) with tesseract (image processing to txt).
Mahesh Muni
  • 51
  • 2
  • 7
0
votes
1 answer

Run a new Hadoop streaming job from current running Job

Is it possible to create and run a new Hadoop streaming job from either A regular Hadoop Java Job that's currently executing, or A Hadoop Mapper (in Python) that's executing as part of a Hadoop streaming job. and how?
T. Webster
  • 9,605
  • 6
  • 67
  • 94
0
votes
1 answer

Duplicated tasks get killed

After I submit job to Hadoop cluster, and job input is split between nodes, I can see that some tasks get two attempts running in parallel. E.g. at node 39 task attempt attempt_201305230321_0019_m_000073_0 is started and in 3 minutes…
Roman Bodnarchuk
  • 29,461
  • 12
  • 59
  • 75
0
votes
1 answer

awk doesn't work in hadoop's mapper

This is my hadoop job: hadoop streaming \ -D mapred.map.tasks=1\ -D mapred.reduce.tasks=1\ -mapper "awk '{if(\$0<3)print}'" \ # doesn't work -reducer "cat" \ -input "/user/***/input/" \ -output "/user/***/out/" this job always fails, with an error…
Alcott
  • 17,905
  • 32
  • 116
  • 173
0
votes
2 answers

How to process Unstuctured data thru Mapreduce

I'm trying to understand Unstructured data first. To me mentioned below is the Unstructured data. I have followed thru the "Hadoop : Definitive Guide" mentioning Earthquake example and that is a structred data have positions defined for location,…
Zelig
  • 45
  • 1
  • 3
  • 10