I am a total hadoop n00b. I am trying to solve the following as my first hadoop project. I have a million+ sub-folders sitting in an amazon S3 bucket. Each of these folders have two files. File 1 has data as follows:
date,purchaseItem,purchaseAmount
01/01/2012,Car,12000
01/02/2012,Coffee,4
....................
File2 has the information of the customer in the following format:
ClientId:Id1
ClientName:"SomeName"
ClientAge:"SomeAge"
This same pattern is repeated across all the folders in the bucket.
Before I write all this data into HDFS, I want to join File1 and File2 as follows:
Joined File:
ClientId,ClientName,ClientAge,date,purchaseItem,purchaseAmount
Id1,"SomeName","SomeAge",01/01/2012,Car,12000
Id1,"SomeName","SomeAge",01/02/2012,Coffee,4
I need to do this for each and every folder and then feed this joined dataset into HDFS. Can somebody point out how would I be able to achieve something like this in Hadoop. A push in the right direction will be much appreciated.