Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
4
votes
4 answers

Unzip files using hadoop streaming

I have many files in HDFS, all of them a zip file with one CSV file inside it. I'm trying to uncompress the files so I can run a streaming job on them. I tried: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -D mapred.reduce.tasks=0…
Miki Tebeka
  • 13,428
  • 4
  • 37
  • 49
4
votes
2 answers

hadoop stream, how to set partition?

I'm very new with hadoop stream and have some difficulties with the partitioning. According to what is found in a line, my mapper function either returns key1, 0, somegeneralvalues # some kind of "header" line where linetype = 0 or key1, 1, value1,…
aherve
  • 3,795
  • 6
  • 28
  • 41
4
votes
6 answers

Python code is valid but Hadoop Streaming produces part-00000 "Empty file"

On an Ubuntu virtual machine I have set up a single-node cluster as per Michael Noll's tutorial and this has been my starting point for writing a Hadoop program. Also, for reference, this. My program is in Python and uses Hadoop Streaming. I have…
dafuloth
  • 43
  • 1
  • 6
3
votes
2 answers

Can we cascade multiple MapReduce jobs in Hadoop Streaming (lang: Python)

I am using Python and have to work on following scenario using Hadoop Streaming: a) Map1->Reduce1->Map2->Reduce2 b) I dont want to store intermediate files c) I dont want to install packages like Cascading, Yelp, Oozie. I have kept them as last…
Piyush Kansal
  • 1,201
  • 4
  • 18
  • 26
3
votes
2 answers

Does hadoop really handle datanode failure?

In our hadoop setup, when a datanode crashes (or) hadoop doesn't respond on the datanode, reduce task fails unable to read from the failed node(exception below). I thought hadoop handles data node failures and that is the main purpose of creating…
Boolean
  • 14,266
  • 30
  • 88
  • 129
3
votes
1 answer

Hadoop cluster - Do I need to replicate my code over all machines before running job?

This is what confuses me, when I use wordcount example, I keep code at master and let him do things with slaves and it runs fine But when I am running my code, it starts to fail on slaves giving weird errors like Traceback (most recent call…
daydreamer
  • 87,243
  • 191
  • 450
  • 722
3
votes
3 answers

hadoop streaming ensuring one key per reducer

I have a mapper that, while processing data, classifies output into 3 different types (type is the output key). My goal is to create 3 different csv files via the reducers, each with all of the data for one key with a header row. The key values can…
underrun
  • 6,713
  • 2
  • 41
  • 53
3
votes
1 answer

How Mapper and Reducer works together "without" sorting?

I know how the map reduces works and what steps I have: Mapping Shuffle and sorting Reducing Off course I have Partitioning, Combiners but that's not important right now. The interesting is that when I run map reduce jobs, looks like mappers and…
grep
  • 5,465
  • 12
  • 60
  • 112
3
votes
1 answer

How to place a file directly in HDFS without using local by directly download a file from a webpage?

I need some help. I am downloading a file from a webpage using python code and placing it in local file system and then transferring it into HDFS using put command and then performing operations on it. But there might be some situations where the…
Rahul
  • 243
  • 2
  • 6
  • 17
3
votes
1 answer

How to create a stop condition on Spark streaming?

I want to use spark streaming for reading data from the HDFS. The idea is that another program will keep on uploading new files to an HDFS directory, which my spark streaming job would process. However, I also want to have an end condition. That is,…
pythonic
  • 20,589
  • 43
  • 136
  • 219
3
votes
2 answers

How to compress Hadoop directory to single gzip file?

I have a directory that contains lots of files and sub directories that I want to compress and export from hdfs to fs. I came across this question - Hadoop: compress file in HDFS? , but it seems like it's relevant only to files, and using…
Elad Leev
  • 908
  • 7
  • 17
3
votes
0 answers

Data locality not achieved for Hadoop jobs

According to the Hadoop manual, mapper jobs should be started on a node where the input data is stored on the HDFS, if a slot is available. Unfortunately I found this not to be true when using the Hadoop Streaming library, as jobs were launched on…
mcserep
  • 3,231
  • 21
  • 36
3
votes
0 answers

Hadoop - Globally sort mean and when is happen in MapReduce

I am using Hadoop streaming JAR for WordCount, I want to know how can I get Globally Sort, according to answer on another question in SO, I found that when we use of just one reducer we can get Globally sort but in my result with numReduceTasks=1…
Saeed Rahmani
  • 650
  • 1
  • 8
  • 29
3
votes
0 answers

Hadoop streaming flat-files to gzip

I've been trying to gzip files (pipe separated csv) in hadoop using the hadoop-streaming.jar. I've found the following thread on stackoverflow: Hadoop: compress file in HDFS? and I tried both solutions (cat/cut for the mapper). Although I end up…
R. Sluiter
  • 162
  • 1
  • 1
  • 13
3
votes
1 answer

How to compare two files using spark?

I want to compare two files if not matched extra records load into another file with the unmatched records. Compare each and every fields in both file and count of records also.
Nathon
  • 165
  • 1
  • 4
  • 13