Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

1 answer

Hadoop streaming with mongo-hadoop connector fails

I created this job that reads a bunch of JSON files from HDFS and tries to load them into MongoDB. It's just the map script because I don't require any additional processing on the reduce step. I'm trying to use the mongo-hadoop connector. The…

asked Jun 13 '16 at 16:32

Tudor Marghidanu

votes

1 answer

Deep Learning: is there any open-source library that can be integrated with Hadoop streaming and MapReduce?

Google search popped out quite a few open source deep learning frameworks. Here is a collected list Google…

python hadoop mapreduce hadoop-streaming deep-learning

asked Jan 21 '16 at 06:41

Osiris

1,007
4
17
30

votes

0 answers

Hadoop mapreduce defining separators for streaming

I'm using Hadoop 2.7.1 I'm really struggling to understand at what point in the streaming process sorts are applied, how you can change the sort order, and the separator. Reading the documentation has confused me further since some config variables…

hadoop mapreduce hadoop-streaming hadoop-partitioning

asked Dec 03 '15 at 22:09

James Owers

7,948
10
55
71

votes

3 answers

wordCounts.dstream().saveAsTextFiles("LOCAL FILE SYSTEM PATH", "txt"); does not write to file

I am trying to write JavaPairRDD into file in local system. Code below: JavaPairDStream wordCounts = words.mapToPair( new PairFunction() { @Override public Tuple2 call(String s)…

apache-spark streaming pyspark spark-streaming hadoop-streaming

asked Nov 22 '15 at 21:56

Amnesiac

votes

5 answers

Who will get a chance to execute first , Combiner or Partitioner?

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204) Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.…

hadoop mapreduce hadoop-streaming hadoop-partitioning combiners

asked Aug 20 '15 at 06:26

Prashant

votes

3 answers

Kafka broker startup memory issue

I am new to Kafka and Hadoop technologies. I was trying to install and run my first Single Node, Single Broker Cluster on AWS EC2 VM instance, I am done with: 1) java installation 2) updating ~/.bashrc and ~/.nash_profile files with java related…

hadoop ubuntu-14.04 apache-kafka hadoop-streaming

asked Aug 13 '15 at 11:04

Chauhan B

votes

2 answers

Understanding DataTorrent with example

I am supposed to work on DataTorrent and looking for articles/documentation to go through. I could not find detailed documentation on what are operators, how are they used for processing our data and about MALHAR library which is being used in…

hadoop bigdata hadoop-yarn hadoop-streaming apache-apex

asked May 26 '15 at 17:06

Atom

votes

1 answer

How to skip failed map tasks in hadoop streaming

I am running a hadoop streaming mapreduce job which has 26895 map tasks in total. However, one task that deals a certain input always fails. So I set mapreduce.map.failures.maxpercent=1 and want to skip failed tasks, but the job was still not…

hadoop mapreduce hadoop-streaming

asked Feb 25 '15 at 22:58

Woaibanzhuan

votes

2 answers

Easy Way to Convert character array to dataframe

I am using R with Hadoop streaming where at the reducer, the value is a character array where each element is a string contains a few columns terminated by certain character, char(2) 002 in this case. Is there an easy way to split the string into…

r hadoop-streaming

asked Nov 17 '14 at 21:57

B.Mr.W.

18,910
35
114
178

votes

2 answers

Custom input reader in spark

I am new to Spark and would like to load page records from a Wikipedia dump into an RDD. I tried using a record reader provided in hadoop streaming but can't figure out how to use it. Could anyone help me make the following code create a nice RDD…

scala hadoop apache-spark hadoop-streaming

asked Oct 08 '14 at 13:24

zermelozf

votes

0 answers

Python + Cassandra CqlPagingInputFormat + Hadoop Streaming

Intro I have a cassandra 1.2.19 cluster with hadoop 1.2.1 installed and configured in fully distributed mode on top of it (plus an extra non-cassandra node as a master), everything works fine and I can run map-reduce jobs on it. The problem Now I…

python hadoop cassandra hadoop-streaming

asked Sep 26 '14 at 02:31

Sergio Ayestarán

5,590
4
38
62

votes

0 answers

hadoop distcp bandwidth issue

I am doing distcp from one hadoop cluster(version 0.20.2) to another hadoop cluster(version 2.2.0) using below command. hadoop distcp -update -skipcrccheck "hftp://x.x.x.x:50070//hive/warehouse//staging_eventlog_arpu_comma" …

hadoop hadoop-streaming distcp

asked Sep 03 '14 at 07:37

user2950086

votes

1 answer

python - PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Recently, I want to parse websites and then use BeautifulSoup to filter what I want and write in csv file in hdfs. Now, I am at the process of filtering website code with BeautifulSoup. I want to use mapreduce method to execute it: hadoop jar…

mapreduce beautifulsoup hadoop-streaming

asked Aug 05 '14 at 09:53

Danny

votes

2 answers

Hadoop : java.io.IOException: Call to localhost/127.0.0.1:54310 failed on local exception: java.io.EOFException

Am new to hadoop, Today only i started with it, I want to write the file to hdfs hadoop server, Am using the server hadoop 1.2.1, When i give jps command in cli am able to see all the nodes are running, 31895 Jps 29419 SecondaryNameNode 29745…

java hadoop filesystems hdfs hadoop-streaming

asked Aug 05 '14 at 03:36

Harry

3,072
6
43
100

votes

2 answers

Hadoop streaming accessing files in a directory

I wish to access a directory in Hadoop (via Python streaming) and loop through its image files, calculating hashes of each in my mapper. Does the following logic make sense (and instead of hard coding, can I pass the directory to Hadoop as e.g.…

python hadoop hadoop-streaming

asked Jul 04 '14 at 14:03

schoon

2,858
3
46
78

Prev 1 2 3

…

58 59 Next