Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
6
votes
1 answer

Hive FAILED: ParseException line 2:0 cannot recognize input near ''macaddress'' 'CHAR' '(' in column specification

I've tried running hive -v -f sqlfile.sql Here is the content of the file CREATE TABLE UpStreamParam ( 'macaddress' CHAR(50), 'datats' BIGINT, 'cmtstimestamp' BIGINT, 'modulation' INT, 'chnlidx' INT, 'severity' BIGINT, 'rxpower' …
Alex Brodov
  • 3,365
  • 18
  • 43
  • 66
6
votes
2 answers

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?
jkalyanc
  • 137
  • 1
  • 2
  • 13
6
votes
4 answers

Python MapReduce Hadoop Streaming Job that requires multiple input files?

I have two files in my cluster File A and File B with the following data - File A #Format: #Food Item | Is_A_Fruit (BOOL) Orange | Yes Pineapple | Yes Cucumber | No Carrot | No Mango | Yes File B #Format: #Food Item | Vendor Name Orange |…
ComputerFellow
  • 11,710
  • 12
  • 50
  • 61
6
votes
2 answers

Map Reduce output to CSV or do I need Key Values?

My map function produces a Key\tValue Value = List(value1, value2, value3) then my reduce function produces: Key\tCSV-Line Ex. 2323232-2322 fdsfs,sdfs,dfsfs,0,0,0,2,fsda,3,23,3,s, 2323555-22222 dfasd,sdfas,adfs,0,0,2,0,fasafa,2,23,s Ex.…
6
votes
3 answers

Running the Python Code on Hadoop Failed

I have tried to follow the instructions on this page: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ $bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -input /user/root/wordcountpythontxt -output…
USB
  • 6,019
  • 15
  • 62
  • 93
6
votes
1 answer

Hadoop Configuration Error

I am attempting to start up my hadoop application, however upon startup i am seeing this in the log files, does anyone have a clue as to what the problem is? Creating filesystem for hdfs://10.170.4.141:9000 java.io.IOException: config() …
godzilla
  • 3,005
  • 7
  • 44
  • 60
6
votes
8 answers

POC for Hadoop in real time scenario

I have a bit of a problem. I want to learn about Hadoop and how I might use it to handle data streams in real time. As such I want to build a meaningful POC around it so that I can showcase it when I have to prove my knowledge of it in front of some…
Kumar Vaibhav
  • 2,632
  • 8
  • 32
  • 54
6
votes
1 answer

EMR How to join files into one?

I've splitted big binary file to (2Gb) chunks and uploaded it to Amazon S3. Now I want to join it back to one file and process with my custom I've tried to run elastic-mapreduce -j $JOBID -ssh \ "hadoop dfs -cat s3n://bucket/dir/in/* >…
denys
  • 2,437
  • 6
  • 31
  • 55
6
votes
2 answers

Amazon Elastic MapReduce - SIGTERM

I have an EMR streaming job (Python) which normally works fine (e.g. 10 machines processing 200 inputs). However, when I run it against large data sets (12 machines processing a total of 6000 inputs, at about 20 seconds per input), after 2.5 hours…
slavi
  • 401
  • 3
  • 10
5
votes
3 answers

Sorting by value in Hadoop from a file

I have a file containing a String, then a space and then a number on every line. Example: Line1: Word 2 Line2 : Word1 8 Line3: Word2 1 I need to sort the number in descending order and then put the result in a file assigning a rank to the…
Deepika Sethi
  • 213
  • 1
  • 2
  • 10
5
votes
1 answer

Python Streaming : how to reduce to multiple outputs?(its possible with Java though)

I read Hadoop in Action and found that in Java using MultipleOutputFormat and MultipleOutputs classes we can reduce the data to multiple files but what I am not sure is how to achieve the same thing using Python streaming. for example: …
daydreamer
  • 87,243
  • 191
  • 450
  • 722
5
votes
1 answer

Hadoop streaming job using Mxnet failing in AWS Emr

I have setup an emr step in AWS datapipeline. The step command looks like this:…
ishan3243
  • 1,870
  • 4
  • 30
  • 49
5
votes
2 answers

How to get s3distcp to merge with newlines

I have many millions of small one line s3 files that I'm looking to merge together. I have the s3distcp syntax down, however, I've discovered that after merging the files no newlines are contained in the merged set. I was wondering if s3distcp…
5
votes
1 answer

How to package python script with dependencies into zip/tar?

I've got a hadoop cluster that I'm doing data analytics on using Numpy, SciPy, and Pandas. I'd like to be able to submit my hadoop jobs as a zip/tar file using the '--file ' argument to a command. That zip file should have EVERYTHING that my python…
jbarney
  • 159
  • 1
  • 8
5
votes
1 answer

os.environ['mapreduce_map_input_file'] doesn't work

I created a simple map reduce in Python, just to test the os.environ['mapreduce_map_input_file'] call, as you can see below: map.py #!/usr/bin/python import sys # input comes from STDIN (stream data that goes to the program) for line in…
bmpasini
  • 1,503
  • 1
  • 23
  • 43
1 2
3
58 59