Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
8
votes
1 answer

How to read hadoop sequential file?

I have a sequential file which is the output of hadoop map-reduce job. In this file data is written in key value pairs ,and value itself is a map. I want to read the value as a MAP object so that i can process it further. Configuration config =…
samarth
  • 3,866
  • 7
  • 45
  • 60
8
votes
1 answer

How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

According to http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/, the formula for determining the number of concurrently running tasks per node is: min (yarn.nodemanager.resource.memory-mb /…
8
votes
1 answer

Using python efficiently to calculate hamming distances

I need to compare a large number of strings similar to 50358c591cef4d76. I have a Hamming distance function (using pHash) I can use. How do I do this efficiently? My pseudocode would be: For each string currentstring= string For each string…
schoon
  • 2,858
  • 3
  • 46
  • 78
8
votes
2 answers

Hadoop: job runs okay on smaller set of data but fails with large dataset

I have a following situation I have 3 machines cluster with following confirguration. Master Usage of /: 91.4% of 74.41GB MemTotal: 16557308 kB MemFree: 723736 kB Slave 01 Usage of /: 52.9% of 29.76GB MemTotal: …
daydreamer
  • 87,243
  • 191
  • 450
  • 722
7
votes
1 answer

hadoop 2.4.0 streaming generic parser options using TAB as separator

I know that the tab is default input separator for fields : stream.map.output.field.separator stream.reduce.input.field.separator stream.reduce.output.field.separator mapreduce.textoutputformat.separator but if i try to write the generic parser…
annunarcist
  • 1,637
  • 3
  • 20
  • 42
7
votes
3 answers

Processing images using hadoop

I'm new to hadoop and I'm going to develop an application which process multiple images using hadoop and show users the results live, while they computation is in progress. The basic approach is distribute executable and bunch of images and gather…
remdezx
  • 2,939
  • 28
  • 49
7
votes
2 answers

How to use a file in a hadoop streaming job using python?

I want to read a list from a file in my hadoop streaming job. Here is my simple mapper.py: #!/usr/bin/env python import sys import json def read_file(): id_list = [] #read ids from a file f = open('../user_ids','r') for line in f: …
Elham
  • 163
  • 2
  • 5
7
votes
2 answers

What is the difference between Rack-local map tasks and Data-local map tasks?

When I run "hadoop job -status xxx",Output the following some list. Rack-local map tasks=124 Data-local map tasks=6 What is the difference between Rack-local map tasks and Data-local map tasks?
Sam
  • 155
  • 2
  • 7
7
votes
1 answer

Python hadoop streaming : Setting a job name

I have a job that runs on my cluster using hadoop-streaming I have to start a new job for which I want to add a job name, how can I pass that option on command-line or file to setup a job name? In Java, you can do this by saying JobConf conf =…
daydreamer
  • 87,243
  • 191
  • 450
  • 722
7
votes
2 answers

How to get the name of input file in MRjob

I'm writing a map function using mrjob. My input will come from files in a directory on HDFS. Names of the files contain a small but crucial piece information that is not present in the files. Is there a way to learn (inside a map function) the…
Bolo
  • 11,542
  • 7
  • 41
  • 60
7
votes
2 answers

Pass directories not files to hadoop-streaming?

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For…
Jon Lasser
  • 241
  • 2
  • 8
6
votes
1 answer

Can I force my reducers (copy phase) to start only when all mappers are completed

I have an hadoop job with a pretty long map phase and I want other short jobs to be run in priority. For this I set the priority of my long job with hadoop job -set-priority job_id LOW. The problem is that, for my long job, the copy phase of the…
user1151446
  • 1,845
  • 3
  • 15
  • 22
6
votes
1 answer

Hadoop Throws ClassCastException for the keytype of java.nio.ByteBuffer

I am using "hadoop-0.20.203.0rc1.tar.gz" for my cluster setup. Whenever I set job.setMapOutputKeyClass(ByteBuffer.class); and run the job I get following Exception: 12/01/13 15:09:00 INFO mapred.JobClient: Task Id :…
samarth
  • 3,866
  • 7
  • 45
  • 60
6
votes
1 answer

Python Hadoop streaming on windows, Script not a valid Win32 application

I have a problem to execute mapreduce python files on Hadoop by using Hadoop streaming.jar. I use: Windows 10 64bit Python 3.6 and my IDE is spyder 3.2.6, Hadoop 2.3.0 jdk1.8.0_161 I can get answer while my maperducec code is written on java…
Mahsa Hassankashi
  • 2,086
  • 1
  • 15
  • 25
6
votes
8 answers

hadoop, python, subprocess failed with code 127

I'm trying to run very simple task with mapreduce. mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: print line my txt file: qwerty asdfgh zxc Command line to run the job: hadoop jar…
Headmaster
  • 2,008
  • 4
  • 24
  • 51
1
2
3
58 59