Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

1 answer

Hive FAILED: ParseException line 2:0 cannot recognize input near ''macaddress'' 'CHAR' '(' in column specification

I've tried running hive -v -f sqlfile.sql Here is the content of the file CREATE TABLE UpStreamParam ( 'macaddress' CHAR(50), 'datats' BIGINT, 'cmtstimestamp' BIGINT, 'modulation' INT, 'chnlidx' INT, 'severity' BIGINT, 'rxpower' …

hadoop hive hadoop-streaming

asked Aug 24 '15 at 20:43

Alex Brodov

3,365
18
43
66

votes

2 answers

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?

hadoop mapreduce hadoop-streaming

asked Apr 19 '15 at 19:22

jkalyanc

votes

4 answers

Python MapReduce Hadoop Streaming Job that requires multiple input files?

python hadoop mapreduce hadoop-streaming

asked Dec 27 '13 at 10:20

ComputerFellow

11,710
12
50
61

votes

2 answers

Map Reduce output to CSV or do I need Key Values?

My map function produces a Key\tValue Value = List(value1, value2, value3) then my reduce function produces: Key\tCSV-Line Ex. 2323232-2322 fdsfs,sdfs,dfsfs,0,0,0,2,fsda,3,23,3,s, 2323555-22222 dfasd,sdfas,adfs,0,0,2,0,fasafa,2,23,s Ex.…

hadoop mapreduce hadoop-streaming elastic-map-reduce

asked Jun 26 '13 at 23:38

Jake Steele

votes

3 answers

Running the Python Code on Hadoop Failed

I have tried to follow the instructions on this page: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ $bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -input /user/root/wordcountpythontxt -output…

python hadoop-streaming

asked Mar 12 '13 at 04:19

USB

6,019
15
62
93

votes

1 answer

Hadoop Configuration Error

I am attempting to start up my hadoop application, however upon startup i am seeing this in the log files, does anyone have a clue as to what the problem is? Creating filesystem for hdfs://10.170.4.141:9000 java.io.IOException: config() …

java hadoop hadoop-streaming

asked Mar 01 '13 at 08:06

godzilla

3,005
7
44
60

votes

8 answers

POC for Hadoop in real time scenario

I have a bit of a problem. I want to learn about Hadoop and how I might use it to handle data streams in real time. As such I want to build a meaningful POC around it so that I can showcase it when I have to prove my knowledge of it in front of some…

hadoop real-time bigdata hadoop-streaming

asked Jan 12 '13 at 15:36

Kumar Vaibhav

2,632
8
32
54

votes

1 answer

EMR How to join files into one?

I've splitted big binary file to (2Gb) chunks and uploaded it to Amazon S3. Now I want to join it back to one file and process with my custom I've tried to run elastic-mapreduce -j $JOBID -ssh \ "hadoop dfs -cat s3n://bucket/dir/in/* >…

amazon-s3 amazon-web-services hadoop-streaming amazon-emr emr

asked Aug 21 '12 at 13:10

denys

2,437
6
31
55

votes

2 answers

Amazon Elastic MapReduce - SIGTERM

I have an EMR streaming job (Python) which normally works fine (e.g. 10 machines processing 200 inputs). However, when I run it against large data sets (12 machines processing a total of 6000 inputs, at about 20 seconds per input), after 2.5 hours…

python hadoop-streaming elastic-map-reduce amazon-emr

asked Aug 15 '12 at 13:59

slavi

votes

3 answers

Sorting by value in Hadoop from a file

I have a file containing a String, then a space and then a number on every line. Example: Line1: Word 2 Line2 : Word1 8 Line3: Word2 1 I need to sort the number in descending order and then put the result in a file assigning a rank to the…

java hadoop hadoop-streaming

asked Nov 27 '11 at 22:10

Deepika Sethi

votes

1 answer

Python Streaming : how to reduce to multiple outputs?(its possible with Java though)

I read Hadoop in Action and found that in Java using MultipleOutputFormat and MultipleOutputs classes we can reduce the data to multiple files but what I am not sure is how to achieve the same thing using Python streaming. for example: …

python hadoop mapreduce hadoop-streaming

asked Sep 29 '11 at 15:58

daydreamer

87,243
191
450
722

votes

1 answer

Hadoop streaming job using Mxnet failing in AWS Emr

I have setup an emr step in AWS datapipeline. The step command looks like this:…

hadoop emr hadoop-streaming amazon-data-pipeline mxnet

asked May 17 '17 at 22:04

ishan3243

1,870
4
30
49

votes

2 answers

How to get s3distcp to merge with newlines

I have many millions of small one line s3 files that I'm looking to merge together. I have the s3distcp syntax down, however, I've discovered that after merging the files no newlines are contained in the merged set. I was wondering if s3distcp…

hadoop amazon-s3 hadoop-streaming amazon-emr

asked Jul 13 '15 at 21:20

isueightynine

votes

1 answer

How to package python script with dependencies into zip/tar?

I've got a hadoop cluster that I'm doing data analytics on using Numpy, SciPy, and Pandas. I'd like to be able to submit my hadoop jobs as a zip/tar file using the '--file ' argument to a command. That zip file should have EVERYTHING that my python…

python hadoop numpy pandas hadoop-streaming

asked Jun 14 '15 at 19:10

jbarney

votes

1 answer

os.environ['mapreduce_map_input_file'] doesn't work

I created a simple map reduce in Python, just to test the os.environ['mapreduce_map_input_file'] call, as you can see below: map.py #!/usr/bin/python import sys # input comes from STDIN (stream data that goes to the program) for line in…

mapreduce hadoop-streaming

asked Mar 06 '15 at 23:03

bmpasini

1,503
1
23
43

Prev 1 2

…

58 59 Next