Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

4 answers

Unzip files using hadoop streaming

I have many files in HDFS, all of them a zip file with one CSV file inside it. I'm trying to uncompress the files so I can run a streaming job on them. I tried: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -D mapred.reduce.tasks=0…

hadoop zip hadoop-streaming

asked Mar 06 '13 at 19:58

Miki Tebeka

13,428
4
37
49

votes

2 answers

hadoop stream, how to set partition?

I'm very new with hadoop stream and have some difficulties with the partitioning. According to what is found in a line, my mapper function either returns key1, 0, somegeneralvalues # some kind of "header" line where linetype = 0 or key1, 1, value1,…

ruby hadoop hadoop-streaming hadoop-partitioning

asked Jan 28 '13 at 22:12

aherve

3,795
6
28
41

votes

6 answers

Python code is valid but Hadoop Streaming produces part-00000 "Empty file"

On an Ubuntu virtual machine I have set up a single-node cluster as per Michael Noll's tutorial and this has been my starting point for writing a Hadoop program. Also, for reference, this. My program is in Python and uses Hadoop Streaming. I have…

python hadoop mapreduce hadoop-streaming

asked Nov 18 '12 at 22:07

dafuloth

votes

2 answers

Can we cascade multiple MapReduce jobs in Hadoop Streaming (lang: Python)

I am using Python and have to work on following scenario using Hadoop Streaming: a) Map1->Reduce1->Map2->Reduce2 b) I dont want to store intermediate files c) I dont want to install packages like Cascading, Yelp, Oozie. I have kept them as last…

python hadoop mapreduce hadoop-streaming

asked Jan 14 '12 at 04:46

Piyush Kansal

1,201
4
18
26

votes

2 answers

Does hadoop really handle datanode failure?

In our hadoop setup, when a datanode crashes (or) hadoop doesn't respond on the datanode, reduce task fails unable to read from the failed node(exception below). I thought hadoop handles data node failures and that is the main purpose of creating…

hadoop mapreduce hadoop-streaming

asked Nov 28 '11 at 20:40

Boolean

14,266
30
88
129

votes

1 answer

Hadoop cluster - Do I need to replicate my code over all machines before running job?

This is what confuses me, when I use wordcount example, I keep code at master and let him do things with slaves and it runs fine But when I am running my code, it starts to fail on slaves giving weird errors like Traceback (most recent call…

python hadoop mapreduce hadoop-streaming

asked Oct 25 '11 at 17:00

daydreamer

87,243
191
450
722

votes

3 answers

hadoop streaming ensuring one key per reducer

I have a mapper that, while processing data, classifies output into 3 different types (type is the output key). My goal is to create 3 different csv files via the reducers, each with all of the data for one key with a header row. The key values can…

hadoop amazon-emr hadoop-streaming

asked Sep 15 '11 at 13:46

underrun

6,713
2
41
53

votes

1 answer

How Mapper and Reducer works together "without" sorting?

I know how the map reduces works and what steps I have: Mapping Shuffle and sorting Reducing Off course I have Partitioning, Combiners but that's not important right now. The interesting is that when I run map reduce jobs, looks like mappers and…

hadoop hadoop-streaming hadoop-partitioning

asked May 29 '19 at 22:56

grep

5,465
12
60
112

votes

1 answer

How to place a file directly in HDFS without using local by directly download a file from a webpage?

I need some help. I am downloading a file from a webpage using python code and placing it in local file system and then transferring it into HDFS using put command and then performing operations on it. But there might be some situations where the…

hadoop hdfs hadoop2 hadoop-streaming

asked Dec 05 '17 at 16:32

Rahul

votes

1 answer

How to create a stop condition on Spark streaming?

I want to use spark streaming for reading data from the HDFS. The idea is that another program will keep on uploading new files to an HDFS directory, which my spark streaming job would process. However, I also want to have an end condition. That is,…

scala hadoop apache-spark spark-streaming hadoop-streaming

asked Oct 09 '17 at 14:27

pythonic

20,589
43
136
219

votes

2 answers

How to compress Hadoop directory to single gzip file?

I have a directory that contains lots of files and sub directories that I want to compress and export from hdfs to fs. I came across this question - Hadoop: compress file in HDFS? , but it seems like it's relevant only to files, and using…

hadoop compression hdfs gzip hadoop-streaming

asked May 29 '17 at 14:00

Elad Leev

votes

0 answers

Data locality not achieved for Hadoop jobs

According to the Hadoop manual, mapper jobs should be started on a node where the input data is stored on the HDFS, if a slot is available. Unfortunately I found this not to be true when using the Hadoop Streaming library, as jobs were launched on…

hadoop hadoop-streaming

asked Jan 24 '17 at 14:56

mcserep

3,231
21
36

votes

0 answers

Hadoop - Globally sort mean and when is happen in MapReduce

I am using Hadoop streaming JAR for WordCount, I want to know how can I get Globally Sort, according to answer on another question in SO, I found that when we use of just one reducer we can get Globally sort but in my result with numReduceTasks=1…

sorting mapreduce hadoop2 reduce hadoop-streaming

asked Nov 16 '16 at 20:46

Saeed Rahmani

votes

0 answers

Hadoop streaming flat-files to gzip

I've been trying to gzip files (pipe separated csv) in hadoop using the hadoop-streaming.jar. I've found the following thread on stackoverflow: Hadoop: compress file in HDFS? and I tried both solutions (cat/cut for the mapper). Although I end up…

hadoop gzip hadoop-streaming

asked Nov 01 '16 at 09:54

R. Sluiter

votes

1 answer

How to compare two files using spark?

I want to compare two files if not matched extra records load into another file with the unmatched records. Compare each and every fields in both file and count of records also.

scala apache-spark hadoop2 hadoop-streaming bigdata

asked Sep 15 '16 at 19:21

Nathon

Prev 1 2 3

…

58 59 Next