merging two files in hadoop

Question

I am a newbie in hadoop framework. So it would help me if someone can guide me thru this. I have two type of files. dirA/ --> file_a , file_b, file_c

dirB/ --> another_file_a, another_file_b...

Files in directory A contains tranascation information.

So something like:

   id, time_stamp
   1 , some_time_stamp
   2 , some_another_time_stamp
   1  , another_time_stamp

So, this kind of information is scattered across all the files in dirA. Now 1st thing to do is: I give a time frame (lets say last week) and I want to find all the unique ids which are present between that time frame.

So, save a file.

Now, dirB files contains the address information. Something like:

    id, address, zip code
     1, fooadd, 12345
     and so on

So all the unique ids outputted by the first file.. I take them as input and then find the address and zip code.

basically the final out is like the sql merge.

Find all the unique ids between a time frame and then merge the address infomration.

I would greatly appreciate any help. Thanks

score 1 · Accepted Answer · answered Sep 25 '12 at 18:22

1

You tagged this as pig, so I'm guessing you're looking to use it to accomplish this? If so, I think that's a great choice - this is really easy in pig!

times = LOAD 'dirA' USING PigStorage(', ') AS (id:int, time:long);
addresses = LOAD 'dirB' USING PigStorage(', ') AS (id:int, address:chararray, zipcode:chararray);
filtered_times = FILTER times BY (time >= $START_TIME) AND (time <= $END_TIME);
just_ids = FOREACH filtered_times GENERATE id;
distinct_ids = DISTINCT just_ids;
result = JOIN distinct_ids BY id, addresses BY id;

Where $START_TIME and $END_TIME are parameters you can pass to the script.

answered Sep 25 '12 at 18:22

Joe K

18,204
2
36
58

1

`store result into 'hdfs://host/path/filename';` – Lorand Bendig Sep 25 '12 at 19:20
hi.. if the date is start date is 2012-01-27 and end date is 2012-03-22, then what should be the query like. In the file, the date is in format 2012-02-12 format?? – frazman Sep 25 '12 at 21:52
For parsing dates, you should write a UDF, or look for a pre-existing one that does this. Here's more info: [UDF Manual](http://wiki.apache.org/pig/UDFManual). Convert it to a long so that pig can do the comparison. – Joe K Sep 25 '12 at 22:20

merging two files in hadoop

1 Answers1