Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

1 answer

What's wrong with following MapReduce code in C?

The format of each line is: date\ttime\tstore name\titem description\tcost\tmethod of payment We want elements 2 (store name) and 4 (cost) We need to write them out to standard output, separated by a tab I am looking to get the total sales per…

c hadoop mapreduce hadoop-streaming

asked Jan 15 '14 at 08:43

Prashant Singh Rathore

votes

1 answer

Multiple mapper and reducer files in streaming job

I've built a mapper and a reducer in Ruby and it runs successfully as a streaming job. However, I need to do a second map and reduce based on output of the last reduce. Is there any way I can define multiple Ruby files for mappers and reducers in my…

ruby hadoop mapreduce hadoop-streaming

asked Jan 14 '14 at 16:12

tolgap

9,629
10
51
65

votes

1 answer

R script log is not getting as output on oozie tasks log

i am using below Rscript as a mapper on hadoop streaming. i want to see log info\warn etc on console of tasktracker or any other place of log that oozie does however its not coming any reason . My oozie job is successfully completed Script #!…

hadoop hadoop-streaming oozie rscript

asked Jan 09 '14 at 06:06

Karn_way

1,005
3
19
42

votes

1 answer

Hadoop: Reading only the "English" pages

I am trying to read the "English" web pages from Common Crawl. I am running these Hadoop jobs in Amazon interface. Please have a look at the following code, That is the Mapper part. I have no Reducer. #!/usr/bin/php

php hadoop amazon-s3 web-crawler hadoop-streaming

asked Jan 08 '14 at 16:36

Dongle

votes

1 answer

python client on Os X streaming on hadoop sandbox

I would like to write mapreduce code - ideally using python - on my apple mac to streaming it on a hadoop sandbox (e.g. Hortonworks or Cloudera). Ideally my development setup is using my Apple Mac python environment & an hadoop VM sandbox (later a…

python macos hadoop streaming hadoop-streaming

asked Dec 29 '13 at 14:49

Enzo

2,543
1
25
38

votes

1 answer

Hadoop : single node vs cluster performance

I am running three MapReduce jobs in sequence (output of one is the input to another) on a Hadoop cluster with 3 nodes (1 master and 2 slaves). Apparently, the total time taken by individual jobs to finish on a single node cluster is less than the…

python-2.7 hadoop hadoop-streaming

asked Dec 28 '13 at 10:34

user765675

votes

2 answers

Does giving two tasks the same name cause problems

I am trying to run two jobs which have the same name. I set the names of the job to be same initializing mapreduce.job.name Does this cause any problem?

hadoop hadoop-streaming

asked Dec 18 '13 at 22:44

user34790

2,020
7
30
37

votes

2 answers

Reducer not completing and getting stuck at 99%

I am having some issues with running a mapreduce job. The mapper completes quickly. However, the reducer gets stuck at 99.33 %. I could see some IO errors in the log. However, isn't hadoop itself supposed to handle the IO errors. I ran the job twice…

hadoop hadoop-streaming

asked Dec 18 '13 at 21:51

user34790

2,020
7
30
37

votes

2 answers

Hadoop Streaming Python Multiple Input Files Single Mapper

I have a single mapper. for line in sys.stdin: #if line is from file1 #process it based on some_arbitrary_logic #emit k,v #if line is from file2 #process it based on another_arbitrary_logic #emit k, v And I need to call…

python hadoop mapreduce cloudera hadoop-streaming

asked Dec 18 '13 at 12:38

ComputerFellow

11,710
12
50
61

votes

1 answer

Hadoop streaming permission issues

Need help with debugging permission issue during hadoop streaming. I try to start awk streaming: // mkdir to all nodes [pocal@oscbda01 ~]$ for i in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 ; do ssh -f oscbda$i mkdir -p…

hadoop permissions mapreduce permission-denied hadoop-streaming

asked Dec 12 '13 at 09:24

Kukuruka

votes

1 answer

Hadoop: Sorting by first two keys numerically?

I am looking for hadoop (using Streaming and Python) to sort the outputs of the Mapper by the first two keys; My mapper prints as follows print '%s\t%s\t%s' & (num1, num2, value) I want my reducers to receive this data sorted by num1 and then num2,…

hadoop hadoop-streaming

asked Nov 30 '13 at 00:01

Mo.

40,243
37
86
131

votes

2 answers

Skipping bad input files in hadoop

I'm using Amazon Elastic MapReduce to process some log files uploaded to S3. The log files are uploaded daily from servers using S3, but it seems that a some get corrupted during the transfer. This results in a java.io.IOException: IO error in map…

hadoop hadoop-streaming elastic-map-reduce

asked Nov 12 '13 at 12:11

Adrian Mester

2,523
1
19
23

votes

1 answer

Cascading hadoop streaming mapreductions with binary data

I'm having trouble to figure out how to use the binary output for a hadoop streaming mapreduction as the input for another hadoop streaming mapreduction. echo.py: import sys while True: buffer = sys.stdin.read(1024) if not buffer: break …

python binary-data hadoop-streaming

asked Nov 02 '13 at 03:01

Igor Gatis

4,648
10
43
66

votes

1 answer

not able to run mapreduce using luigi

i am new to map-reduce jobs.May be a some basic questions but the existing documentation didn't helped me. How to run mapreduce jobs using luigi. For example wordcount_hadoop.py what are the parameters i need to pass to start a job python…

spotify hadoop-streaming

asked Oct 09 '13 at 19:38

user2695817

votes

2 answers

Hadoop Install R

Hi there I have a hadoop cluster and I am thinking about writing my own Mapper and Reducer in R, then use Hadoop Streaming to do some time series analysis. However, I am wondering what is the 'common' way to install any kind of software across the…

hadoop hadoop-streaming

asked Oct 08 '13 at 22:06

B.Mr.W.

18,910
35
114
178

Prev 1 2 3

…

58 59 Next