Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
0
votes
1 answer

What's wrong with following MapReduce code in C?

The format of each line is: date\ttime\tstore name\titem description\tcost\tmethod of payment We want elements 2 (store name) and 4 (cost) We need to write them out to standard output, separated by a tab I am looking to get the total sales per…
0
votes
1 answer

Multiple mapper and reducer files in streaming job

I've built a mapper and a reducer in Ruby and it runs successfully as a streaming job. However, I need to do a second map and reduce based on output of the last reduce. Is there any way I can define multiple Ruby files for mappers and reducers in my…
tolgap
  • 9,629
  • 10
  • 51
  • 65
0
votes
1 answer

R script log is not getting as output on oozie tasks log

i am using below Rscript as a mapper on hadoop streaming. i want to see log info\warn etc on console of tasktracker or any other place of log that oozie does however its not coming any reason . My oozie job is successfully completed Script #!…
Karn_way
  • 1,005
  • 3
  • 19
  • 42
0
votes
1 answer

Hadoop: Reading only the "English" pages

I am trying to read the "English" web pages from Common Crawl. I am running these Hadoop jobs in Amazon interface. Please have a look at the following code, That is the Mapper part. I have no Reducer. #!/usr/bin/php
Dongle
  • 602
  • 1
  • 8
  • 18
0
votes
1 answer

python client on Os X streaming on hadoop sandbox

I would like to write mapreduce code - ideally using python - on my apple mac to streaming it on a hadoop sandbox (e.g. Hortonworks or Cloudera). Ideally my development setup is using my Apple Mac python environment & an hadoop VM sandbox (later a…
Enzo
  • 2,543
  • 1
  • 25
  • 38
0
votes
1 answer

Hadoop : single node vs cluster performance

I am running three MapReduce jobs in sequence (output of one is the input to another) on a Hadoop cluster with 3 nodes (1 master and 2 slaves). Apparently, the total time taken by individual jobs to finish on a single node cluster is less than the…
user765675
0
votes
2 answers

Does giving two tasks the same name cause problems

I am trying to run two jobs which have the same name. I set the names of the job to be same initializing mapreduce.job.name Does this cause any problem?
user34790
  • 2,020
  • 7
  • 30
  • 37
0
votes
2 answers

Reducer not completing and getting stuck at 99%

I am having some issues with running a mapreduce job. The mapper completes quickly. However, the reducer gets stuck at 99.33 %. I could see some IO errors in the log. However, isn't hadoop itself supposed to handle the IO errors. I ran the job twice…
user34790
  • 2,020
  • 7
  • 30
  • 37
0
votes
2 answers

Hadoop Streaming Python Multiple Input Files Single Mapper

I have a single mapper. for line in sys.stdin: #if line is from file1 #process it based on some_arbitrary_logic #emit k,v #if line is from file2 #process it based on another_arbitrary_logic #emit k, v And I need to call…
ComputerFellow
  • 11,710
  • 12
  • 50
  • 61
0
votes
1 answer

Hadoop streaming permission issues

Need help with debugging permission issue during hadoop streaming. I try to start awk streaming: // mkdir to all nodes [pocal@oscbda01 ~]$ for i in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 ; do ssh -f oscbda$i mkdir -p…
0
votes
1 answer

Hadoop: Sorting by first two keys numerically?

I am looking for hadoop (using Streaming and Python) to sort the outputs of the Mapper by the first two keys; My mapper prints as follows print '%s\t%s\t%s' & (num1, num2, value) I want my reducers to receive this data sorted by num1 and then num2,…
Mo.
  • 40,243
  • 37
  • 86
  • 131
0
votes
2 answers

Skipping bad input files in hadoop

I'm using Amazon Elastic MapReduce to process some log files uploaded to S3. The log files are uploaded daily from servers using S3, but it seems that a some get corrupted during the transfer. This results in a java.io.IOException: IO error in map…
Adrian Mester
  • 2,523
  • 1
  • 19
  • 23
0
votes
1 answer

Cascading hadoop streaming mapreductions with binary data

I'm having trouble to figure out how to use the binary output for a hadoop streaming mapreduction as the input for another hadoop streaming mapreduction. echo.py: import sys while True: buffer = sys.stdin.read(1024) if not buffer: break …
Igor Gatis
  • 4,648
  • 10
  • 43
  • 66
0
votes
1 answer

not able to run mapreduce using luigi

i am new to map-reduce jobs.May be a some basic questions but the existing documentation didn't helped me. How to run mapreduce jobs using luigi. For example wordcount_hadoop.py what are the parameters i need to pass to start a job python…
user2695817
  • 121
  • 1
  • 7
0
votes
2 answers

Hadoop Install R

Hi there I have a hadoop cluster and I am thinking about writing my own Mapper and Reducer in R, then use Hadoop Streaming to do some time series analysis. However, I am wondering what is the 'common' way to install any kind of software across the…
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178