Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
3
votes
1 answer

Hadoop streaming with mongo-hadoop connector fails

I created this job that reads a bunch of JSON files from HDFS and tries to load them into MongoDB. It's just the map script because I don't require any additional processing on the reduce step. I'm trying to use the mongo-hadoop connector. The…
3
votes
1 answer

Deep Learning: is there any open-source library that can be integrated with Hadoop streaming and MapReduce?

Google search popped out quite a few open source deep learning frameworks. Here is a collected list Google…
Osiris
  • 1,007
  • 4
  • 17
  • 30
3
votes
0 answers

Hadoop mapreduce defining separators for streaming

I'm using Hadoop 2.7.1 I'm really struggling to understand at what point in the streaming process sorts are applied, how you can change the sort order, and the separator. Reading the documentation has confused me further since some config variables…
James Owers
  • 7,948
  • 10
  • 55
  • 71
3
votes
3 answers

wordCounts.dstream().saveAsTextFiles("LOCAL FILE SYSTEM PATH", "txt"); does not write to file

I am trying to write JavaPairRDD into file in local system. Code below: JavaPairDStream wordCounts = words.mapToPair( new PairFunction() { @Override public Tuple2 call(String s)…
3
votes
5 answers

Who will get a chance to execute first , Combiner or Partitioner?

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204) Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.…
3
votes
3 answers

Kafka broker startup memory issue

I am new to Kafka and Hadoop technologies. I was trying to install and run my first Single Node, Single Broker Cluster on AWS EC2 VM instance, I am done with: 1) java installation 2) updating ~/.bashrc and ~/.nash_profile files with java related…
Chauhan B
  • 461
  • 8
  • 27
3
votes
2 answers

Understanding DataTorrent with example

I am supposed to work on DataTorrent and looking for articles/documentation to go through. I could not find detailed documentation on what are operators, how are they used for processing our data and about MALHAR library which is being used in…
Atom
  • 768
  • 1
  • 15
  • 35
3
votes
1 answer

How to skip failed map tasks in hadoop streaming

I am running a hadoop streaming mapreduce job which has 26895 map tasks in total. However, one task that deals a certain input always fails. So I set mapreduce.map.failures.maxpercent=1 and want to skip failed tasks, but the job was still not…
Woaibanzhuan
  • 127
  • 7
3
votes
2 answers

Easy Way to Convert character array to dataframe

I am using R with Hadoop streaming where at the reducer, the value is a character array where each element is a string contains a few columns terminated by certain character, char(2) 002 in this case. Is there an easy way to split the string into…
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
3
votes
2 answers

Custom input reader in spark

I am new to Spark and would like to load page records from a Wikipedia dump into an RDD. I tried using a record reader provided in hadoop streaming but can't figure out how to use it. Could anyone help me make the following code create a nice RDD…
zermelozf
  • 545
  • 1
  • 8
  • 16
3
votes
0 answers

Python + Cassandra CqlPagingInputFormat + Hadoop Streaming

Intro I have a cassandra 1.2.19 cluster with hadoop 1.2.1 installed and configured in fully distributed mode on top of it (plus an extra non-cassandra node as a master), everything works fine and I can run map-reduce jobs on it. The problem Now I…
Sergio Ayestarán
  • 5,590
  • 4
  • 38
  • 62
3
votes
0 answers

hadoop distcp bandwidth issue

I am doing distcp from one hadoop cluster(version 0.20.2) to another hadoop cluster(version 2.2.0) using below command. hadoop distcp -update -skipcrccheck "hftp://x.x.x.x:50070//hive/warehouse//staging_eventlog_arpu_comma" …
user2950086
  • 135
  • 1
  • 1
  • 13
3
votes
1 answer

python - PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Recently, I want to parse websites and then use BeautifulSoup to filter what I want and write in csv file in hdfs. Now, I am at the process of filtering website code with BeautifulSoup. I want to use mapreduce method to execute it: hadoop jar…
Danny
  • 33
  • 1
  • 1
  • 7
3
votes
2 answers

Hadoop : java.io.IOException: Call to localhost/127.0.0.1:54310 failed on local exception: java.io.EOFException

Am new to hadoop, Today only i started with it, I want to write the file to hdfs hadoop server, Am using the server hadoop 1.2.1, When i give jps command in cli am able to see all the nodes are running, 31895 Jps 29419 SecondaryNameNode 29745…
Harry
  • 3,072
  • 6
  • 43
  • 100
3
votes
2 answers

Hadoop streaming accessing files in a directory

I wish to access a directory in Hadoop (via Python streaming) and loop through its image files, calculating hashes of each in my mapper. Does the following logic make sense (and instead of hard coding, can I pass the directory to Hadoop as e.g.…
schoon
  • 2,858
  • 3
  • 46
  • 78