I have a single mapper.
for line in sys.stdin:
#if line is from file1
#process it based on some_arbitrary_logic
#emit k,v
#if line is from file2
#process it based on another_arbitrary_logic
#emit k, v
And I need to call this mapper through a hadoop streaming API with -input file1
and another -input file2
.
How do I achieve this? How do I know which line belongs to which file in the STDIN
that hadoop streaming gives me?
UPDATE
File1
Fruit, Vendor, Cost
Oranges, FreshOrangesCompany, 50
Apples, FreshAppleCompany, 100
File2
Vendor, Location, NumberOfOffices
FreshAppleCompany, NewZealand, 45
FreshOrangeCompany, FijiIslands, 100
What I need to do is print out in how many offices do they sell oranges.
Oranges 100
.
So both files need to be INPUT
to the mapper.