0

We are using pig-0.11.0-cdh4.3.0 with a CDH4 cluster and we need to de-duplicate some web logs. The solution idea (expressed in SQL) is something like this:

SELECT
     T1.browser,
     T1.click_type,
     T1.referrer,
     T1.datetime,
     T2.datetime
FROM
     My_Table T1
INNER JOIN My_Table T2 ON
     T2.browser = T1.browser AND
     T2.click_type = T1.click_type AND
     T2.referrrer = T1.referrer AND
     T2.datetime > T1.datetime AND
     T2.datetime <= DATEADD(mi, 1, T1.datetime)

I grabbed the above from here SQL find duplicate records occuring within 1 minute of each other . I am hoping I can implement a similar solution in Pig but I am finding that apparently Pig does not support JOIN via an expression (only by fields) as is required by the above join. Do you know how to de-duplicate events that are near by 1 minute with Pig? Thanks!

Community
  • 1
  • 1
Marquez
  • 5,891
  • 3
  • 31
  • 40
  • Which version of pig are you using? – alexeipab Jul 16 '13 at 13:46
  • What is the format for the data time? Could you provide examples? – alexeipab Jul 16 '13 at 13:47
  • We are using pig-0.11.0-cdh4.3.0 with a CDH4 cluster. Regarding the date format I already have code that parses the logs and extract the date (using Pigs ToDate function) into a DateTime value. About the relation schema, say for a moment the tuples have: browser, refererrer, click_type and datetime and sometimes there are dups with the same browser, click_type and referrer within 1 minute time difference. – Marquez Jul 16 '13 at 14:04
  • Would you be happy with one unique record per minute even though the difference between the 2 events is less than 60 seconds? – alexeipab Jul 16 '13 at 14:17
  • If I understand you correctly yes. That is exactly what I am looking for. To be more precise: I would like the keep the most recent of the 2 and throw away the older one. Said in other words: if any 2 events (that match on browser, click_type and referrer) are within 60 seconds of each other then filter out the older one and and keep the most recent one. – Marquez Jul 16 '13 at 14:31
  • If you have following timestamps in mm:ss format for the same group (01:00, 01:01, 01:04, 02:01, 02:56), which ones do you want to leave? Is it (02:56, 01:04)? – alexeipab Jul 16 '13 at 15:55

4 Answers4

0

From top of my head, something like this could work, but needs testing:

view = FOREACH input GENERATE browser, click_type, referrer, datetime, GetYear(datetime) as year, GetMonth(datetime) as month, GetDay(datetime) as day, GetHour(datetime) as hour, GetMinute(datetime) as minute;
grp = GROUP view BY (browser, click_type, referrer, year, month, day, hour, minute);
uniq = FOREACH grp {
    top = LIMIT view 1;
    GENERATE FLATTEN(view.(browser, click_type, referrer, datetime))
}

Of cause here if one event is at 12:03:45 and another at 12:03:59, these would be in the same group and 12:04:45 with 12:05:00 would be in different groups.

To get the exact 60 seconds difference you would need to write a UDF which would iterate over a sorted bag grouped on (browser, click_type, referrer) and remove unwanted rows.

alexeipab
  • 3,609
  • 14
  • 16
0

One approach is you can do like this group by the required parameters

         top3 = foreach grpd {
            sorted = filter records by time < 60;
            top    = limit sorted 2;
            generate group, flatten(top);
         };
Nag
  • 339
  • 3
  • 6
  • How would you calculate the "time" to do the "filter records by time < 60"? – alexeipab Jul 16 '13 at 14:46
  • Marquez already extract the date (using Pigs ToDate function) into a DateTime value. find the MAX time for the group and do the difference and check for < 60 sec – Nag Jul 16 '13 at 14:51
  • How exactly would you do the difference between the rows of the same bag? – alexeipab Jul 16 '13 at 14:57
0

this will be another approach

   records_group = group records by (browser, click_type, referrer);

   with_min = FOREACH records_group 
   GENERATE
   FLATTEN(records), MAX(records.datetime) as maxDt ;

  filterRecords = filter with_min by (maxDt - $2 ) <60;

$2 is the datatime position change it accordingly

Nag
  • 339
  • 3
  • 6
  • " filterRecords = filter with_min by (maxDt - $2 ) <60;" will leave all records that are within the 60 seconds of the latest datetime of the group. – alexeipab Jul 16 '13 at 15:06
  • Also the events like (browser, click_type, referrer, 12:03:45) and (browser, click_type, referrer, 12:04:50) are not duplicates, where (browser, click_type, referrer, 12:04:51) is a duplicate.Having a single maxDt per group does not allow for a sliding window of 60 seconds. – alexeipab Jul 16 '13 at 15:08
  • Sorry Aleks, I am confused with your second point. I hope OP needs to filter all the records with the group (browser, click_type, referrer) event time < 60 sec. Please refer is sql OP trying to do the same, comparing with MAX dt will filter all the records with in the specified time limit. To be more precise: "I would like the keep the most recent of the 2 and throw away the older one. Said in other words: if any 2 events (that match on browser, click_type and referrer) are within 60 seconds of each other then filter out the older one and and keep the most recent one." – Nag Jul 16 '13 at 15:14
  • In SQL "T2.datetime > T1.datetime AND T2.datetime <= DATEADD(mi, 1, T1.datetime)" means from T1.datetime to 60 seconds from T1.datetime. The SQL finds the duplicated records but does not exclude them from the result, thus does not perform de-duplication, more left outer joins are required to finish the job. In pig "group by" works differently from the "join" in SQL. For example for the same group key we have following timestamps in format mm:ss (01:00, 01:01, 01:04, 02:01, 02:56). The Pig script you have provided will return only (02:01, 02:56) where the correct result is (02:56, 01:04)) – alexeipab Jul 16 '13 at 15:32
  • "records_group = group records by (browser, click_type, referrer);" does not include any information about time, so the group will contain records that can be hours apart. MAX(records.datetime) will take the latest time and "(maxDt - $2 ) <60;" will filter out all records from the final result that are older than 60 seconds, even those that are not duplicate and are hours apart. PIG works differently from SQL. – alexeipab Jul 16 '13 at 15:43
  • Aleks this is one from OP "Said in other words: if any 2 events (that match on browser, click_type and referrer) are within 60 seconds of each other then filter out the older one and and keep the most recent one."" he needs the latest 2 which is < 60 sec ie he expects only (02:01, 02:56). Please let me know if i missed some thing – Nag Jul 16 '13 at 15:43
  • Same (browser, click_type and referrer) group can have multiple windows of 60 seconds. It is easy to visualise if you draw a timeline and put timestamps on it and try to cover them with 60 second windows from the latest time. First 60 second widow covers (02:01, 02:56), the most recent one of these 2 events is 02:56, thus he does not need the 02:01. But (01:04, 02:56) are not in the same 60 second window, 01:04 is in the second 60sec window, so the rule of exclusion does not apply to it, thus 01:04 can stay in the result set and exclude the (01:00, 01:01). – alexeipab Jul 16 '13 at 15:49
  • The difficulty with PIG is without UDF you can not compare 2 records from the same bag, thus can not enforce the sliding 60 seconds window. – alexeipab Jul 16 '13 at 15:57
  • This is a clever and interesting approach however I am not sure it will work in my scenario and I apologize if the SQL example is an incorrect representation of my goal. I will update it shortly if that is the case. – Marquez Jul 16 '13 at 16:02
  • The scenario I believe will not work with this answer is as follows. Assume for a moment there are 3 identical events in these times: 00:00:00 00:00:45 00:02:00 the max will be 00:02:00 so it will be more than 60 seconds from the other two events. As a result the first event (at time 0) will not be filtered. The goal is to filter the first event because there is a dup 45 seconds after it (within 60 seconds) but the approach in this answer fails to do that I believe. – Marquez Jul 16 '13 at 16:03
  • Aleks and Marq , records_group = group records by (browser, click_type, referrer); with_min = FOREACH records_group GENERATE records, ABS(MAX(records.datetime) -records.datetime) as maxDtgroup ; regroup = group with_min by (browser, click_type, referrer, maxDtgroup); and filter the top 1 record from the regroup. – Nag Jul 16 '13 at 16:28
  • Yes, 2 "group by" will work. Just a small correction "MAX(records.datetime) -records.datetime" will not be allowed as records.datetime is a BAG. But if you rewrite it as the with_min as: with_min = FOREACH records_group GENERATE FLATTEN(records), MAX(records.datetime) as max ; with_min = FOREACH with_min GENERATE browser, click_type, referrer, ABS(max - dateime) as maxDtgroup; it will compile. The down side is the 2 Map Reduce Jobs that the pig will generate, but the upside is that no UDF code is needed. UDF could do it one Map Reduce job if the performance is a concern. Thanks Nag!!! – alexeipab Jul 16 '13 at 16:47
  • Alex, thanks for the inputs it is really a fun. But how did you conclude that PIG will do 2 MR ? – Nag Jul 16 '13 at 16:56
  • In Map Reduce, a "group by" is a separate Hadoop Job, as whatever columns are after "BY", form the key emitted by the Mapper. 2 x "group by" = 2 MR Jobs. Pig always tries to optimize and cause as few jobs as possible by using multi-query, but with a FOREACH logic in between the 2 of the "group by" statements it will not be able to do so. – alexeipab Jul 16 '13 at 17:07
  • Aleks, I have a similar pig script which has 2 GROUP BY and in between a FOREACH statement. Out put is ONE MR job from PIG runtime. I hope some how pig does the optimization. – Nag Jul 16 '13 at 17:36
  • Depends on the script, does it have a FLATTEN in between the 2 "group by"? Do you have the data flow like: INPUT -> GROUP BY 1 -> FLATTEN -> GROUP BY 2? – alexeipab Jul 16 '13 at 17:41
  • inpt = load '.../data.txt' as (i1, i2, d:int); grp = group inpt by (i1, i2); crs = foreach grp generate FLATTEN(inpt), MAX(inpt.d) as max; crs = foreach crs generate i1, i2, d, ABS(d - max); rs = group crs by (i1, i2); dump rs; causes the Job Stats (time in seconds): JobId Alias Feature Outputs job_local_0001 crs,grp,inpt GROUP_BY job_local_0002 rs GROUP_BY file:/tmp/temp1120531244/tmp-1455515639, – alexeipab Jul 16 '13 at 17:41
  • Aleks, yes it contains the Flatten in between the 2 groups. records_group = group records by (device); with_min = FOREACH records_group GENERATE FLATTEN(records) AS ( device:chararray, dt:int, metric:chararray, value:chararray), MIN(records.dt) as mindt; regroup = group with_min by (device,mindt); This script makes one MR JOB – Nag Jul 16 '13 at 17:48
  • Counter Map Reduce Total Launched reduce tasks 0 0 1 Launched map tasks 0 0 1 Data-local map tasks 0 0 1 – Nag Jul 16 '13 at 17:52
  • At the end of each PIG execution, there is a Job DAG and Job Status outputs . What are they? – alexeipab Jul 16 '13 at 17:58
  • JOB DAG is Job DAG: job_201307141618_0008 -> job_201307141618_0009, job_201307141618_0009 – Nag Jul 16 '13 at 18:00
  • So it is 2 Map Reduce Jobs. – alexeipab Jul 17 '13 at 06:52
0

Aleks and Marq ,

  records_group = group records by (browser, click_type, referrer);

  with_min = FOREACH records_group 
           GENERATE FLATTEN(records), MAX(records.datetime) as max 

  with_min = FOREACH with_min GENERATE browser, click_type, referrer, 
            ABS(max - dateime) as maxDtgroup;

  regroup = group with_min by (browser, click_type, referrer, maxDtgroup);

Re-group with maxDtGroup is the key and filter the top 1 record.

Nag
  • 339
  • 3
  • 6