Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

1 answer

Sending exact binary sequences using Hadoop streaming

There are sets of binary files that I need to split (according to some logic) and distribute to mappers. I use Hadoop streaming for this. The main problem is to send the exact binary chunks over the wire without altering them. It turned out that…

hadoop hadoop-streaming

asked Oct 08 '13 at 19:03

Y.H.

2,687
1
29
38

votes

0 answers

Hadoop Streaming Hangs at Output: /path../output

Hi I wrote myself two scripts in Python as the mapper and reducer for Hadoop Streaming. I run the code and it successfully finished the mapping and reducing, both 100%. But it just hung there at the end of the process. output looks like…

python hadoop hadoop-streaming

asked Oct 07 '13 at 21:48

B.Mr.W.

18,910
35
114
178

votes

1 answer

Error in running Python script in Hadoop-Streaming application

I am running Hadoop sample application given in 'Hadoop in Action'by Chuck Lam on Win 7 notebook on Cygwin environment. Python is installed on Cygwin and sample python application running. When I run hadoop streaming application it is throwing…

python hadoop-streaming

asked Oct 07 '13 at 16:22

Shailesh

votes

1 answer

How do i input an array to a Map Reduce Job?

I have a service that is continuously retrieving some data .I am dumping this data into an array, this data has to be further processed. Is it possible to create a dynamic array that keeps getting updated by serivice, and side by side i can execute…

hadoop mapreduce hadoop-streaming

asked Sep 26 '13 at 06:25

David

votes

2 answers

Log file analysis in Hadoop/MapReduce

Hi I have some query log files of the following form: q_string q_visits q_date 0 red ballons 1790 2012-10-02 00:00:00 1 blue socks 364 2012-10-02 00:00:00 2 current 280 2012-10-02 00:00:00 3 molecular …

hadoop mapreduce hadoop-streaming elastic-map-reduce

asked Sep 21 '13 at 12:41

user7289

32,560
28
71
88

votes

1 answer

Add Gem to Distributed Cache in Hive

I have a ruby script that I want to use with Hive streaming. This script requires the use of an external gem. Because this gem is not installed on my data nodes, the script will not run. I would prefer to be able to add this gem on a temporary…

ruby hive hadoop-streaming

asked Sep 10 '13 at 23:38

DJElbow

3,345
11
41
52

votes

1 answer

Can I run Hadoop streaming applications without setting up HDFS?

Can I run a Hadoop streaming application without setting up HDFS? I'd like to test a Hadoop streaming application on my local machine. In particular I'm trying to follow the instructions for this tutorial but, instead of specifying a path on the…

hadoop hdfs hadoop-streaming

asked Sep 09 '13 at 19:00

MRocklin

55,641
23
163
235

votes

1 answer

How to pass external jar through the commnadline while running MapReduce?

I use nltk within Python MapReduce program and use the below command to execute it. I have found out that I am not able to pass nltk correctly along with the command. Could anyone let me know what is the correct syntax? Thanks.

hadoop-streaming

asked Sep 05 '13 at 07:02

user2699073

votes

1 answer

Unable to find files in hadoop streaming

I have a similar issue to Hadoop Streaming - Unable to find file error . However none of the solutions presented there are working. My command line is: hadoop jar /mnt/shared/hadoop-streaming-1.0.3.jar -input /user/cloudera/mz_paf/batch_sk=1234…

hadoop-streaming

asked Aug 24 '13 at 02:07

WestCoastProjects

58,982
91
316
560

votes

2 answers

how to give descent sort without using any sort command parameter

Now I want to do descend sort without using any sort command parameter.So i think out one way,let every value multiple -1,and the max will be min,the min will be max.And then because sort comand is sorted by the first value,if not added any comand…

python shell hadoop hadoop-streaming

asked Aug 23 '13 at 09:06

liumilan

votes

2 answers

how to extract the key from the log in python

i write the python code ,in order to extract key from the log.And using the same log,it worked well in one machine.But when i run it in hadoop,it failed.I guess there are some bugs when using regex.Who can give me some comments?Is regex can't…

python hadoop hadoop-streaming

asked Aug 23 '13 at 04:44

liumilan

votes

2 answers

Threading with Hadoop Streaming

I am making use of Hadoop streaming to write a python based HTML grabber. I find that running a single threaded python script is slow. I want to modify it to a multithreaded version. Does anyone know what would be a good number to set the number of…

python multithreading hadoop hadoop-streaming amazon-emr

asked Aug 06 '13 at 17:53

viper

2,220
5
27
33

votes

1 answer

Take the output of "select" in hive as the input of Hadoop jar input file

I am experimenting with a machine learning package called vowpal wabbit. To run vowpal wabbit on our hadoop cluster, it recommends to do: hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar \ …

hadoop jar hive hadoop-streaming

asked Aug 05 '13 at 20:52

Heidi Qiao

votes

1 answer

how partition works on the data from mapper to reducer?

I'm using hadoop streaming to process a huge file. Say I have a file, each line is a number, I want to split this file into 2 files, one containing odd numbers, the other even. Using hadoop, I might specify 2 reducers for this job, cause when the…

hadoop-streaming partition

asked Aug 05 '13 at 03:39

Alcott

17,905
32
116
173

votes

2 answers

XML File Input Map/Reduce Hadoop Windows Server

I am working on Hadoop Platform (by HortonWorks) installed on Windows Server and coding Map/Reduce files in C#. I have an input folder with 100k xml files. I want to read each xml file and write each tag in one row. Please follow below…

xml hadoop mapreduce windows-server-2008 hadoop-streaming

asked Aug 02 '13 at 19:05

Varun Gupta

1,419
6
28
53

Prev 1 2 3

…

58 59 Next