Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

3 answers

Processing logs in Amazon EMR with or without using Hive

I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries. Now, Not all entries in a log file are useful the entries which are useful…

asked Dec 21 '12 at 04:43

Deepak Garg

votes

1 answer

"The location specified by MRJOB_CONF" in mrjob documentation

Which path is "The location specified by MRJOB_CONF" in mrjob documentation? Link to mrjob doc: http://mrjob.readthedocs.org/en/latest/guides/configs-basics.html

hadoop mapreduce hadoop-streaming elastic-map-reduce mrjob

asked Dec 15 '12 at 09:07

user1403483

votes

1 answer

StreamInputFormat for mapreduce job

I have an application that connects to a remote system and transfers data from it using sftp protocol. I want to use a mapreduce job to do the same. I would need a input format that reads from an input stream . I have been going through the docs for…

hadoop mapreduce hadoop-streaming

asked Dec 12 '12 at 04:59

RadAl

votes

1 answer

Efficient Hadoop Word counting for large file

I want to implement a hadoop reducer for word counting. In my reducer I use a hash table to count the words.But if my file is extremely large the hash table will use extreme amount of memory.How I can address this issue ? (E.g A file with 10 million…

python hadoop hadoop-streaming

asked Dec 01 '12 at 20:12

nikosdi

2,138
5
26
35

votes

2 answers

Use Hadoop Streaming to run binary via script

I am new to Hadoop and I am trying to figure out a way to do the following: I have multiple input image files. I have binary executables that processes these files. These binary executables write text files as output. I have a folder that contains…

hadoop hadoop-streaming

asked Nov 30 '12 at 20:55

AlexIIP

2,461
5
29
44

votes

1 answer

Hadoop streaming with single mapper

I am using Hadoop streaming, I start the script as following: ../hadoop/bin/hadoop jar ../hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar \ -mapper ../tests/mapper.php \ -reducer ../tests/reducer.php \ -input data …

hadoop hadoop-streaming

asked Nov 28 '12 at 12:28

Nick

9,962
4
42
80

votes

1 answer

reducer just wont start in hadoop streaming

I am not sure whats happening, but I wrote a simple mapper and reducer script. And I am testing it against a small dataset (like few lines long). For some reason reducer is just not starting.. and mapper is executing again and again? 12/11/20…

hadoop hadoop-streaming

asked Nov 20 '12 at 16:43

frazman

32,081
75
184
269

votes

1 answer

map reduce to read a file from ftp

We have an application that downloads files from FTP server . We are planning to improve its efficiency using Map reduce to download the files from ftp . My first question is , is it actually possible to improve efficiency using Map reduce ? What we…

hadoop ftp parallel-processing mapreduce hadoop-streaming

asked Nov 20 '12 at 07:44

RadAl

votes

2 answers

debugging hadoop streaming progam

I have data in form id, movieid , date, time 3710100, 13502, 2012-09-10, 12:39:38.000 Now basically what I want to do is this.. I want to find out, how many times a particular movie is watched between 7 am and 11 am at 30 minute…

hadoop hadoop-streaming

asked Nov 20 '12 at 03:36

frazman

32,081
75
184
269

votes

1 answer

running pig script with udf on hadoop

I m new to hadoop and pig. I wonder how to run a pig script that internally calls a UDF method? The thing is I dont see the statement "register blah.jar" mentioned like on Pig UDF Manual site: register myudfs.jar; A = load 'student_data' as (name:…

hadoop apache-pig hadoop-streaming

asked Nov 18 '12 at 10:45

trillions

3,669
10
40
59

votes

2 answers

Error in Hadoop installation

I am trying to install Hadoop on fedora machine by seeing here Installed java (and verified whether java exists with java -version) and it exists I had ssh installed(since it is linux) Downloaded latest version hadoop 1.0.4 from here I have…

java python apache hadoop hadoop-streaming

asked Oct 31 '12 at 11:10

Shiva Krishna Bavandla

25,548
75
193
313

votes

1 answer

finding the smallest number hadoop streaming python

I am new to hadoop framework and map reduce abstraction. Basically, I thought of finding the smallest number in a huge text file (delimited by ",") So, here is my code mapper.py #!/usr/bin/env python import sys # input comes from STDIN…

python hadoop mapreduce hadoop-streaming

asked Oct 04 '12 at 18:46

frazman

32,081
75
184
269

votes

1 answer

merging two files in hadoop

I am a newbie in hadoop framework. So it would help me if someone can guide me thru this. I have two type of files. dirA/ --> file_a , file_b, file_c dirB/ --> another_file_a, another_file_b... Files in directory A contains tranascation…

hadoop apache-pig hadoop-streaming

asked Sep 25 '12 at 15:57

frazman

32,081
75
184
269

votes

1 answer

hadoop streaming with python modules

I've seen a technique (on stackoverflow) for executing a hadoop streaming job using zip files to store referenced python modules. I'm having some errors during the mapping phase of my job's execution. I'm fairly certain it's related to the zip'd…

python hadoop hadoop-streaming

asked Sep 23 '12 at 13:51

ct_

1,189
4
20
34

votes

1 answer

hadoop streaming get node id

In hadoop streaming, is there a way to get the ID of a node handling a given task? By way of analogy, this snippet gives the name of the input file for the task: #!/usr/bin/env python import os map_input_file = str(os.environ["map_input_file"]) I'm…

hadoop environment-variables hadoop-streaming

asked Sep 03 '12 at 16:04

Abe

22,738
26
82
111

Prev 1 2 3

…

58 59 Next