Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
0
votes
3 answers

Processing logs in Amazon EMR with or without using Hive

I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries. Now, Not all entries in a log file are useful the entries which are useful…
Deepak Garg
  • 366
  • 3
  • 12
0
votes
1 answer

"The location specified by MRJOB_CONF" in mrjob documentation

Which path is "The location specified by MRJOB_CONF" in mrjob documentation? Link to mrjob doc: http://mrjob.readthedocs.org/en/latest/guides/configs-basics.html
user1403483
0
votes
1 answer

StreamInputFormat for mapreduce job

I have an application that connects to a remote system and transfers data from it using sftp protocol. I want to use a mapreduce job to do the same. I would need a input format that reads from an input stream . I have been going through the docs for…
RadAl
  • 404
  • 5
  • 23
0
votes
1 answer

Efficient Hadoop Word counting for large file

I want to implement a hadoop reducer for word counting. In my reducer I use a hash table to count the words.But if my file is extremely large the hash table will use extreme amount of memory.How I can address this issue ? (E.g A file with 10 million…
nikosdi
  • 2,138
  • 5
  • 26
  • 35
0
votes
2 answers

Use Hadoop Streaming to run binary via script

I am new to Hadoop and I am trying to figure out a way to do the following: I have multiple input image files. I have binary executables that processes these files. These binary executables write text files as output. I have a folder that contains…
AlexIIP
  • 2,461
  • 5
  • 29
  • 44
0
votes
1 answer

Hadoop streaming with single mapper

I am using Hadoop streaming, I start the script as following: ../hadoop/bin/hadoop jar ../hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar \ -mapper ../tests/mapper.php \ -reducer ../tests/reducer.php \ -input data …
Nick
  • 9,962
  • 4
  • 42
  • 80
0
votes
1 answer

reducer just wont start in hadoop streaming

I am not sure whats happening, but I wrote a simple mapper and reducer script. And I am testing it against a small dataset (like few lines long). For some reason reducer is just not starting.. and mapper is executing again and again? 12/11/20…
frazman
  • 32,081
  • 75
  • 184
  • 269
0
votes
1 answer

map reduce to read a file from ftp

We have an application that downloads files from FTP server . We are planning to improve its efficiency using Map reduce to download the files from ftp . My first question is , is it actually possible to improve efficiency using Map reduce ? What we…
RadAl
  • 404
  • 5
  • 23
0
votes
2 answers

debugging hadoop streaming progam

I have data in form id, movieid , date, time 3710100, 13502, 2012-09-10, 12:39:38.000 Now basically what I want to do is this.. I want to find out, how many times a particular movie is watched between 7 am and 11 am at 30 minute…
frazman
  • 32,081
  • 75
  • 184
  • 269
0
votes
1 answer

running pig script with udf on hadoop

I m new to hadoop and pig. I wonder how to run a pig script that internally calls a UDF method? The thing is I dont see the statement "register blah.jar" mentioned like on Pig UDF Manual site: register myudfs.jar; A = load 'student_data' as (name:…
trillions
  • 3,669
  • 10
  • 40
  • 59
0
votes
2 answers

Error in Hadoop installation

I am trying to install Hadoop on fedora machine by seeing here Installed java (and verified whether java exists with java -version) and it exists I had ssh installed(since it is linux) Downloaded latest version hadoop 1.0.4 from here I have…
Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313
0
votes
1 answer

finding the smallest number hadoop streaming python

I am new to hadoop framework and map reduce abstraction. Basically, I thought of finding the smallest number in a huge text file (delimited by ",") So, here is my code mapper.py #!/usr/bin/env python import sys # input comes from STDIN…
frazman
  • 32,081
  • 75
  • 184
  • 269
0
votes
1 answer

merging two files in hadoop

I am a newbie in hadoop framework. So it would help me if someone can guide me thru this. I have two type of files. dirA/ --> file_a , file_b, file_c dirB/ --> another_file_a, another_file_b... Files in directory A contains tranascation…
frazman
  • 32,081
  • 75
  • 184
  • 269
0
votes
1 answer

hadoop streaming with python modules

I've seen a technique (on stackoverflow) for executing a hadoop streaming job using zip files to store referenced python modules. I'm having some errors during the mapping phase of my job's execution. I'm fairly certain it's related to the zip'd…
ct_
  • 1,189
  • 4
  • 20
  • 34
0
votes
1 answer

hadoop streaming get node id

In hadoop streaming, is there a way to get the ID of a node handling a given task? By way of analogy, this snippet gives the name of the input file for the task: #!/usr/bin/env python import os map_input_file = str(os.environ["map_input_file"]) I'm…
Abe
  • 22,738
  • 26
  • 82
  • 111