Joining two files in the same directory using hadoop

Question

I am a total hadoop n00b. I am trying to solve the following as my first hadoop project. I have a million+ sub-folders sitting in an amazon S3 bucket. Each of these folders have two files. File 1 has data as follows:

date,purchaseItem,purchaseAmount
01/01/2012,Car,12000
01/02/2012,Coffee,4
....................

File2 has the information of the customer in the following format:

ClientId:Id1
ClientName:"SomeName"
ClientAge:"SomeAge"

This same pattern is repeated across all the folders in the bucket.

Before I write all this data into HDFS, I want to join File1 and File2 as follows:

Joined File:

ClientId,ClientName,ClientAge,date,purchaseItem,purchaseAmount
Id1,"SomeName","SomeAge",01/01/2012,Car,12000
Id1,"SomeName","SomeAge",01/02/2012,Coffee,4

I need to do this for each and every folder and then feed this joined dataset into HDFS. Can somebody point out how would I be able to achieve something like this in Hadoop. A push in the right direction will be much appreciated.

May be you can consider, loading these files to rdbms database like mysql programmatically, join them to a table in that db and dump that table to HDFS using Sqoop. That is very straight forward I think. — Alper, Mar 07 '13 at 08:41
@Alper - Thanks. I was thinking more along the lines of having the map job to figure out the fileNames which is some/path/to/file1 and some/path/to/file2. The "some/path/to" can probably be used as the key and the value will be the line entry for File1 and the parsed content for File2. In the reduce phase I was thinking if it would be possible to join the entries based on the key "some/path/to". I practically know next to nothing about Map-reduce so maybe this is not the way to go. — sc_ray, Mar 07 '13 at 08:56
You have to be careful about how to use hadoop and mapreduce jobs, Hadoop does its magic when you have file with big content in the hdfs. I could not understand your way of going but moving your files to hdfs system not seems to be an appropriate for mapreduce solution. — Alper, Mar 07 '13 at 09:50
@Alper- So if you have billions of files that need to be pre-processed and pushed into HDFS, hadoop is not the right solution? This seems rather counter-intuitive to me. — sc_ray, Mar 08 '13 at 05:58
@sc_ray- in case of the probability to misunderstand your requirements, I will stick with this explanation: related to your question, at least i can say: Using small number of large files are preffered to large number of small files as it says in this link: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ So you have millions of small files and I assume they are not even in the hdfs. — Alper, Mar 08 '13 at 07:24
@Alepr - Thanks for the link. I am aware of the It's not a million but billions of files that exceeds the block size limit of hadoop. I don't think copying the data into RDBMS joining it in there and sending it to HDFS is performant. Thanks for all your help though — sc_ray, Mar 08 '13 at 15:14
This seems pretty trivial, but I'd need at least the Hadoop version. Also, as Alper already pointed out, tons of small files mess with HDFS blocks, so it's not a particularly optimal thing to store in HDFS, but there are no real other problems with it. — TC1, Mar 31 '13 at 20:47
Do note, however, that without some hardcore sorcery, there's no way you're getting the **output** of the reducer into those same folders that the input came from. This is not what Hadoop does. If you're fine with the output lines all going to the same folder, my previous comment holds and it's fairly trivial. — TC1, Mar 31 '13 at 20:58
@TC1 - Thanks. This is Hadoop 2.0. The order in which the output goes into the reducer doesn't really matter. What I need is a way to get the two files in the folder joined before it is fed into the reducer or any other computation is thrown at it. — sc_ray, Apr 02 '13 at 01:31

score 3 · Accepted Answer · answered Apr 03 '13 at 18:44

What comes to mind quickly is an implementation in cascading.

Figure out a way to turn your rows into columns for File2 programmatically so that you can iterate over all the folders and transpose the file so that your 1st column is your 1st row.

For just one subfolder: Perhaps setting up Two Schemes a TextDelimited Scheme for File 1 and a TextLine Scheme for File 2. Set these up as Taps then wrap each of these into a MultiSourceTap this concatenates all those files into one Pipe.

At this point you should have two separate MultiSourceTaps one for all the File1(s) and one for all the File2(s). Keep in mind some of the details in between here, it may be best to just set this up for one subfolder and then iterated over the other million subfolders and output to some other area then use hadoop fs -getmerge to get all the output small files into one big one.

Keeping with the Cascading theme, then you could construct Pipes to add the subfolder name using new Insert(subfolder_name) inside and Each function so that both your data sets have a reference to the subfolder it came from to join them together then... Join them using cascading CoGroup or Hive-QL Join.

There may be a much easier implementation than this but this is what come to mind thinking quickly. :)

TextDelimited, TextLine, MultiSourceTap

This looks promising. Thanks! – sc_ray Apr 04 '13 at 05:32 — sc_ray, Apr 04 '13 at 05:32

score 0 · Answer 2 · answered Apr 02 '13 at 15:23

0

Have a look at the CombineFileInputFormat.

answered Apr 02 '13 at 15:23

Pieterjan

617
6
17

Thanks. Can you briefly explain how CombineFileInputFormat class can help me here? – sc_ray Apr 02 '13 at 15:39

Joining two files in the same directory using hadoop

2 Answers2