Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
0
votes
1 answer

How do I set up a distributed map-reduce job using hadoop streaming and ruby mappers/reducers?

I'm able to run a local mapper and reducer built using ruby with an input file. I'm unclear about the behavior of the distributed system though. For the production system, I have a HDFS set up across two machines. I know that if I store a large file…
Nikhil
  • 3,042
  • 2
  • 16
  • 16
0
votes
1 answer

Hadoop Streaming task failure

I have a relatively simple program written in C++ and I have been using Hadoop Streaming for MapReduce jobs (my version of Hadoop is Cloudera). Recently, I found that a lot of streaming tasks are keep failing and restarted by task tracker while…
ablimit
  • 2,301
  • 6
  • 27
  • 41
0
votes
1 answer

File split/partition in hadoop

In hadoop filesystem, I have two files say X and Y. Normally, hadoop makes chunks of files X and Y of 64 MB in size. Is it possible to force hadoop to divide the two files such that a 64 MB chunk is created out of 32 MB from X and 32 MB from Y. In…
justin waugh
  • 885
  • 3
  • 12
  • 22
-1
votes
1 answer

Read image in hadoop

How to convert an image to sequence file format in hadoop?. I dont want to read a bunch of files, just a single image and manipulate it.
-1
votes
1 answer

How to read TXT fiels from multiple cloud storage buckets in spark?

I want to list all the buckets from cloud storage which matches gs://bucketname*. I have tried using gsutil which is working but the same is not working from spark read or readstream. gs://bucket1 gs://bucket2 gs://bucketN working: gsutil ls…
-1
votes
2 answers

MapReduce Reducer of 2 Keys - Python

This should be pretty simple and I have put a few hours into this. Example Data (name, binary, count): Adam 0 1 Adam 1 1 Adam 0 1 Mike 1 1 Mike 0 1 Mike 1 1 Desired Example Output (name, binary, count): Adam 0 2 Adam 1 1 Mike 0 1 Mike 1 2 …
CP3
  • 1
  • 1
-1
votes
3 answers

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1, worked perfectly on local

I have googled this error on each and every forum but no luck. I have got the error written below: 18/08/29 00:24:53 INFO mapreduce.Job: map 0% reduce 0% 18/08/29 00:24:59 INFO mapreduce.Job: Task Id : attempt_1535105716146_0226_m_000000_0, Status…
anshita
  • 1
  • 1
  • 1
-1
votes
1 answer

Error in R-Hive streaming

I am using Hive and R in order to score my machine learning model on a large dataset. However the code is giving the following error. I have tested the R script separately in my local for any errors and ensured that it is error free. Can somebody…
-1
votes
1 answer

can we use log4j in mapreduce?

Can we use log4j to log in mapreduce? If so, provide the steps to use log4j in map-reduce to log the information. I have written the below log4.properties but, nothing was logged.
-1
votes
1 answer

Get all tweets based on SPECIFIC word and STORE all tweets in SINGLE BAG

I am trying to process the sample tweet and store the tweets based on the filtered criteria. For example, sample tweet:- {"created_time": "18:47:31 ", "text": "RT @Joey7Barton: ..give a word about whether the americans wins a Ryder cup. I mean…
Mohan.V
  • 141
  • 1
  • 1
  • 10
-1
votes
4 answers

Runtime Error in the Max temperature Mapreduce java code

I am running a mapreduce code, an error I am getting is Error: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable at test.temp$Mymapper.map(temp.java:1) at…
harsh mehta
  • 7
  • 1
  • 2
-1
votes
1 answer

need to select a string between two two known characters in hadoop hive

How to check only for a keyword in a string which occurs after 2nd % & before 3rd % in hadoop hive? For eg. products%apple products%security%firewalls%adaptive security appliances (asa)% the returning keyword should be "security"
Hemaraj ku
  • 19
  • 2
-1
votes
2 answers

HDInsight - Azure blob storage

I have some basic clarifications about azure hdInsight. The following article gives some basic input on using hdinsight. https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-emulator-get-started/. It says that HDinsight…
-1
votes
1 answer

Convert Hadoop job to Spark

I am trying to migrate the following Hadoop job to Spark. public class TextToSequenceJob { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Job job =…
Edamame
  • 23,718
  • 73
  • 186
  • 320
-1
votes
1 answer

How to implement a map reduce job which will split a file into smaller sub files so that it can be read in memory

I am trying to write a map reduce job in python.The first mapper will be splitting the files into multiple subfiles and the reducer will do some manupulation on the same files and combine it How do I write to split the files randomly in python in…
Bg1850
  • 3,032
  • 2
  • 16
  • 30