Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

3 answers

Getting friends within a specified degree with MapReduce

Do you know how can I implement this algorithm using the MapReduce paradigm? def getFriends(self, degree): friendList = [] self._getFriends(degree, friendList) return friendList def _getFriends(self, degree, friendList): …

asked Jul 25 '13 at 20:43

pm3310

votes

1 answer

DiskErrorException on slave machine - Hadoop multinode

I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files . 13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008_0, Status : FAILED Too many fetch-failures 13/07/25…

hadoop mapreduce hadoop-streaming hadoop-plugins hadoop-partitioning

asked Jul 25 '13 at 07:19

Surya

3,408
5
27
35

votes

1 answer

Error on starting HDFS daemons on hadoop Multinode cluster

Issue While Hadoop multi-node set-up .As soon as i start My hdfs demon on Master (bin/start-dfs.sh) i did got below logs on Master starting namenode, logging to…

hadoop hadoop-streaming hadoop-plugins hadoop-partitioning

asked Jul 24 '13 at 07:03

Surya

3,408
5
27
35

votes

0 answers

Hadoop Streaming Job limited to 6 Maps and 6 Reduces

So I'm running a pretty basic (just a search for a simple expression) program via Hadoop streaming on my 3-node cluster. When I run the job, JobTracker informs me that only 6 maps and 6 reduces are running, with 2000 pending map and reduce jobs. Why…

python hadoop mapreduce hadoop-streaming

asked Jul 19 '13 at 18:47

Alpha

votes

3 answers

Running a R script using hadoop streaming Job Failing : PipeMapRed.waitOutputThreads(): subprocess failed with code 1

I have a R script which works perfectly fine in R Colsole ,but when I am running in Hadoop streaming it is failing with the below error in Map phase .Find the Task attempts log The Hadoop Streaming Command I have…

r hadoop mapreduce hadoop-streaming

asked Jul 03 '13 at 14:31

user1281780

votes

2 answers

Creating more partitions than reducers

When developing locally on my single machine, I believe the default number of reducers is 6. In a particular MR step, I actually divide up the data into n partitions where n can be greater than 6. From what I have observed, it looks like only 6 of…

hadoop hadoop-streaming hadoop-partitioning

asked Jun 27 '13 at 01:38

syker

10,912
16
56
68

votes

1 answer

Hadoop map reduce - access missing data

Lets say I have a client script that pulls a large size of data from hadoop. What functionality in hadoop gives me advantage of looking at the retrieved data and ask for (point out) a missing part of data, to make a specific request just to read…

hadoop bigdata hadoop-streaming

asked Jun 27 '13 at 00:03

Majoris

2,963
6
47
81

votes

1 answer

hadoop streaming error,mapreduce with python

I'm newbie to hadoop environment,Do you have any idea about how to solve this error,or what may be the reason behind this error? hduser@intel-HP-Pavilion-g6-Notebook-PC:~/hduser/hadoop$ sudo ./bin/hadoop jar…

hadoop mapreduce hadoop-streaming

asked Jun 20 '13 at 06:25

deadendtux

votes

2 answers

Sampling Records from Hadoop Mapper

I have a dataset whose key consists of 3 parts: a, b and c. In my mapper, I would like to emit records with the key as 'a' and the value as 'a,b,c' How do I emit 10% of the total records for each 'a' that is detected from the mapper in Hadoop?…

hadoop hadoop-streaming

asked Jun 19 '13 at 22:06

syker

10,912
16
56
68

votes

3 answers

hadoop input format for hadoop streaming. Wikihadoop Input Format

I wonder whether there is any differences between the InputFormats for hadoop and hadoop streaming. Does the Input Formats for hadoop streaming work also for hadoop and vice versa? I am asking this because I found a special Input Format for the…

hadoop hadoop-streaming

asked Jun 14 '13 at 15:05

user2426139

votes

1 answer

I have tesseract-ocr and hadoop separately. I need to integrate them

According to my project of image processing. I need is to integrate hadoop (parallel processor) with tesseract (image processing to txt).

hadoop tesseract hadoop-streaming hadoop-plugins

asked Jun 13 '13 at 06:22

Mahesh Muni

votes

1 answer

Run a new Hadoop streaming job from current running Job

Is it possible to create and run a new Hadoop streaming job from either A regular Hadoop Java Job that's currently executing, or A Hadoop Mapper (in Python) that's executing as part of a Hadoop streaming job. and how?

hadoop mapreduce hadoop-streaming

asked Jun 10 '13 at 07:17

T. Webster

9,605
6
67
94

votes

1 answer

Duplicated tasks get killed

After I submit job to Hadoop cluster, and job input is split between nodes, I can see that some tasks get two attempts running in parallel. E.g. at node 39 task attempt attempt_201305230321_0019_m_000073_0 is started and in 3 minutes…

hadoop mapreduce hadoop-streaming

asked May 27 '13 at 14:50

Roman Bodnarchuk

29,461
12
59
75

votes

1 answer

awk doesn't work in hadoop's mapper

This is my hadoop job: hadoop streaming \ -D mapred.map.tasks=1\ -D mapred.reduce.tasks=1\ -mapper "awk '{if(\$0<3)print}'" \ # doesn't work -reducer "cat" \ -input "/user/***/input/" \ -output "/user/***/out/" this job always fails, with an error…

awk hadoop-streaming

asked May 22 '13 at 01:45

Alcott

17,905
32
116
173

votes

2 answers

How to process Unstuctured data thru Mapreduce

I'm trying to understand Unstructured data first. To me mentioned below is the Unstructured data. I have followed thru the "Hadoop : Definitive Guide" mentioning Earthquake example and that is a structred data have positions defined for location,…

java mapreduce hadoop-streaming

asked May 15 '13 at 14:22

Zelig

Prev 1 2 3

…

58 59 Next