-1

I have written Mapper and Reducer programs using R language. I am using the Hadoop streaming utility to execute the R programs on hadoop. My constraint is that i need to input 2 text files to the mapper program. How to achieve it? Kindly assist at the earliest.

For single input, I am placing the input file in the HDFS and referring them using the stdin command.But how to achieve it for multiple input files

user2500875
  • 33
  • 1
  • 2
  • 8

2 Answers2

0

If you specify multiple input files they both are streamed through stdin. Order of records is arbitrary. To figure out which one you are actually reading at a certain time you can call Sys.getenv("map_input_file").

Alpha
  • 807
  • 1
  • 10
  • 14
piccolbo
  • 1,305
  • 7
  • 17
  • If you are doing a join you could use the rmr2 package where these gory details are taken care of for you. – piccolbo Aug 13 '13 at 18:32
0

This is a great tutorial teaching you how to use Hadoop Streaming in Python. However, the example in that tutorial is reading 3 books, in your case 2 files, from a directory by doing something like this:

hduser@ubuntu:/usr/local/hadoop$ 
bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py   -reducer /home/hduser/reducer.py \
-input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

The

-input /user/hduser/guttenberg/* 

Will read all the files in that HDFS folder and process it.

Hope this solves your problem.

B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178