Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
0
votes
1 answer

OpenCL with Hadoop

How do I use OpenCL(for GPU compute) with Hadoop ? My data set resides in HDFS. I need to compute 5 metrics, among which 2 are compute intensive. So I want to compute those 2 metrics on GPU using OpenCL and the rest 3 metrics using java map reduce…
0
votes
1 answer

Hadoop file system is physical file system or virtual file system

Hadoop file system is physical file system or virtual file system
0
votes
1 answer

How to integrate NLTK with Hadoop HDFS?

I have a working sentiment analysis program using NLTK which reads the text from a .txt file placed in my local machine. Now i would like to read txt file placed in Hadoop HDFS and perform same sentiment analysis. How can i achieve this ? Any…
Praveen Gr
  • 187
  • 1
  • 9
  • 22
0
votes
1 answer

Include map and reduce in written in C/OpenCL in hadoop

I have written my own codes of map and reduce function in OpenCL kernel. General scenario of MapReduce which is basically incorporated in Hadoop which itself is written in java. How can I use my own map-reduce codes written in C/OpenCL in hadoop on…
sandeep.ganage
  • 1,409
  • 2
  • 21
  • 47
0
votes
1 answer

How to decode a binary file which must be decoded using an external binary in one shot?

I have a large number of input files in a proprietary binary format. I need to turn them into rows for further processing. Each file must be decoded in one shot by an external binary (i.e. files must not be concatenated or split). Options that I'm…
corsair
  • 347
  • 3
  • 13
0
votes
1 answer

Splitting responsibilities of mappers on Elastic MapReduce (MySQL + MongoDB input)

I want to make sure I understand EMR correctly. I'm wondering - does what I'm talking about make any sense with EMR / Hadoop? I currently have a recommendation engine on my app that examines data stored in both MySQL and MongoDB (both on separate…
nlyn
  • 606
  • 1
  • 7
  • 20
0
votes
1 answer

Adding extra arguements to HadoopJarStepConfig fails

I am trying to get this command via the AWS SDK: hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -input hdfs:///logs/ -output hdfs:///no_dups -mapper dedup_mapper.py -reducer dedup_reducer.py -file deduplication.py dedup_mapper.py…
Shane
  • 2,315
  • 3
  • 21
  • 33
0
votes
1 answer

As the input data of Amazon-EMR, is the only legal format plain text?

As quoted in the "Developer Guide" of Amazon EMR, the files in the input directory should be formatted as plain text. Does it mean that i cannot upload some binary files or .png files and parse them by python script?
kururu
  • 1
  • 3
0
votes
3 answers

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ? I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using…
Samy Louize Hanna
  • 821
  • 2
  • 8
  • 15
0
votes
1 answer

the number of running mappers in a hadoop job

Using streaming, I set the number of map to 200, like this: -D mapred.map.tasks=200 -D mapred.job.map.capacity=200 But later I found that the number of running mappers is just 9, with 500+ mapper tasks pending. This looks pretty weird to me, cuz I…
Alcott
  • 17,905
  • 32
  • 116
  • 173
0
votes
1 answer

Consilidating multiple Hadoop clusters

We have multiple hadoop cluster using hive and pig, what is the best way to consolidate them in to one? In BI this was done by build EDW or MDM approach. how about hadoop is any one thinking about this
0
votes
2 answers

configuring Hadoop to use a different Reducer process for each key?

Related to my question I have a streaming process written in Python. I notice that each Reducer gets all the values associated with multiple keys through sys.stdin. I would prefer to have the sys.stdin only have the values associated with one key.…
Shane
  • 2,315
  • 3
  • 21
  • 33
0
votes
1 answer

Reducer getting multiple keys through sys.stdin?

I know that all the values associated with a Key are sent to a single Reducer. Is it the case that a Reducer could get multiple keys at once via it's standard input? My use case is that I am splitting lines into key-value pairs, then I want to send…
Shane
  • 2,315
  • 3
  • 21
  • 33
0
votes
1 answer

mapred.local.dir error in hadoop streaming

Error: hadoop_admin@ubuntu:~/hadoop$ bin/hadoop jar /home/hadoop_admin/hadoop/contrib/streaming/hadoop-0.20.0-streaming.jar -input data -output DOUT -mapper /home/balachanderp/libsvm-hadoop-master/scripts/mapperLibsvm.py -reducer…
Bala
  • 390
  • 3
  • 8
  • 25
0
votes
1 answer

What does PipeMapRed do in Hadoop streaming?

I run a hadoop job for more than one time, and every time it takes too much time to finish, like *15 mins * in all. I checked the syslog, found out that, org.apache.hadoop.streaming.PipeMapRed was doing something for about 10 mins, and after…
Alcott
  • 17,905
  • 32
  • 116
  • 173