I have two files in my cluster File A
and File B
with the following data -
File A
#Format:
#Food Item | Is_A_Fruit (BOOL)
Orange | Yes
Pineapple | Yes
Cucumber | No
Carrot | No
Mango | Yes
File B
#Format:
#Food Item | Vendor Name
Orange | Vendor A
Pineapple | Vendor B
Cucumber | Vendor B
Carrot | Vendor B
Mango | Vendor A
Basically I want to find out How many fruits are each vendor selling?
Expected output:
Vendor A | 2
Vendor B | 1
I need to do this using hadoop streaming python map reduce.
I have read how to do a basic word count, I read from sys.stdin
and emit k,v
pairs for the reducer to then reduce.
How do I approach this problem?
My main concern is how to read from multiple files and then compare them in Hadoop Streaming.
I can do this in normal python (i.e without MapReduce & Hadoop, it's straightforward.) but it's infeasible for the sheer size of data that I have with me.