Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

1 answer

How to read hadoop sequential file?

I have a sequential file which is the output of hadoop map-reduce job. In this file data is written in key value pairs ,and value itself is a map. I want to read the value as a MAP object so that i can process it further. Configuration config =…

asked Nov 25 '11 at 05:54

samarth

3,866
7
45
60

votes

1 answer

How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

According to http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/, the formula for determining the number of concurrently running tasks per node is: min (yarn.nodemanager.resource.memory-mb /…

amazon-web-services hadoop-streaming elastic-map-reduce hadoop-yarn hadoop2

asked Aug 07 '14 at 22:18

verve

votes

1 answer

Using python efficiently to calculate hamming distances

I need to compare a large number of strings similar to 50358c591cef4d76. I have a Hamming distance function (using pHash) I can use. How do I do this efficiently? My pseudocode would be: For each string currentstring= string For each string…

python performance hadoop-streaming

asked Jul 04 '14 at 10:30

schoon

2,858
3
46
78

votes

2 answers

Hadoop: job runs okay on smaller set of data but fails with large dataset

I have a following situation I have 3 machines cluster with following confirguration. Master Usage of /: 91.4% of 74.41GB MemTotal: 16557308 kB MemFree: 723736 kB Slave 01 Usage of /: 52.9% of 29.76GB MemTotal: …

java hadoop mapreduce hadoop-streaming

asked Jul 22 '12 at 16:40

daydreamer

87,243
191
450
722

votes

1 answer

hadoop 2.4.0 streaming generic parser options using TAB as separator

I know that the tab is default input separator for fields : stream.map.output.field.separator stream.reduce.input.field.separator stream.reduce.output.field.separator mapreduce.textoutputformat.separator but if i try to write the generic parser…

python hadoop utf-8 mapreduce hadoop-streaming

asked May 27 '15 at 18:57

annunarcist

1,637
3
20
42

votes

3 answers

Processing images using hadoop

I'm new to hadoop and I'm going to develop an application which process multiple images using hadoop and show users the results live, while they computation is in progress. The basic approach is distribute executable and bunch of images and gather…

image-processing hadoop mapreduce hdfs hadoop-streaming

asked Apr 14 '14 at 10:23

remdezx

2,939
28
49

votes

2 answers

How to use a file in a hadoop streaming job using python?

I want to read a list from a file in my hadoop streaming job. Here is my simple mapper.py: #!/usr/bin/env python import sys import json def read_file(): id_list = [] #read ids from a file f = open('../user_ids','r') for line in f: …

python hadoop hadoop-streaming

asked Nov 07 '13 at 10:38

Elham

votes

2 answers

What is the difference between Rack-local map tasks and Data-local map tasks?

When I run "hadoop job -status xxx",Output the following some list. Rack-local map tasks=124 Data-local map tasks=6 What is the difference between Rack-local map tasks and Data-local map tasks?

hadoop mapreduce hadoop-streaming

asked Oct 07 '12 at 13:49

Sam

votes

1 answer

Python hadoop streaming : Setting a job name

I have a job that runs on my cluster using hadoop-streaming I have to start a new job for which I want to add a job name, how can I pass that option on command-line or file to setup a job name? In Java, you can do this by saying JobConf conf =…

python hadoop mapreduce hadoop-streaming

asked Jul 17 '12 at 18:14

daydreamer

87,243
191
450
722

votes

2 answers

How to get the name of input file in MRjob

I'm writing a map function using mrjob. My input will come from files in a directory on HDFS. Names of the files contain a small but crucial piece information that is not present in the files. Is there a way to learn (inside a map function) the…

python hadoop hadoop-streaming mrjob

asked Jul 11 '12 at 14:26

Bolo

11,542
7
41
60

votes

2 answers

Pass directories not files to hadoop-streaming?

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For…

hadoop hadoop-streaming

asked Apr 10 '12 at 20:08

Jon Lasser

votes

1 answer

Can I force my reducers (copy phase) to start only when all mappers are completed

I have an hadoop job with a pretty long map phase and I want other short jobs to be run in priority. For this I set the priority of my long job with hadoop job -set-priority job_id LOW. The problem is that, for my long job, the copy phase of the…

configuration hadoop mapreduce hadoop-streaming

asked Jan 16 '12 at 08:32

user1151446

1,845
3
15
22

votes

1 answer

Hadoop Throws ClassCastException for the keytype of java.nio.ByteBuffer

I am using "hadoop-0.20.203.0rc1.tar.gz" for my cluster setup. Whenever I set job.setMapOutputKeyClass(ByteBuffer.class); and run the job I get following Exception: 12/01/13 15:09:00 INFO mapred.JobClient: Task Id :…

hadoop mapreduce bytebuffer hadoop-streaming

asked Jan 13 '12 at 15:37

samarth

3,866
7
45
60

votes

1 answer

Python Hadoop streaming on windows, Script not a valid Win32 application

I have a problem to execute mapreduce python files on Hadoop by using Hadoop streaming.jar. I use: Windows 10 64bit Python 3.6 and my IDE is spyder 3.2.6, Hadoop 2.3.0 jdk1.8.0_161 I can get answer while my maperducec code is written on java…

python windows hadoop mapreduce hadoop-streaming

asked Feb 21 '18 at 22:02

Mahsa Hassankashi

2,086
1
15
25

votes

8 answers

hadoop, python, subprocess failed with code 127

I'm trying to run very simple task with mapreduce. mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: print line my txt file: qwerty asdfgh zxc Command line to run the job: hadoop jar…

python hadoop mapreduce cloudera hadoop-streaming

asked Mar 27 '17 at 14:06

Headmaster

2,008
4
24
51

Prev 1

…

58 59 Next