We are using pig-0.11.0-cdh4.3.0 with a CDH4 cluster and we need to de-duplicate some web logs. The solution idea (expressed in SQL) is something like this:
SELECT
T1.browser,
T1.click_type,
T1.referrer,
T1.datetime,
T2.datetime
FROM
My_Table T1
INNER JOIN My_Table T2 ON
T2.browser = T1.browser AND
T2.click_type = T1.click_type AND
T2.referrrer = T1.referrer AND
T2.datetime > T1.datetime AND
T2.datetime <= DATEADD(mi, 1, T1.datetime)
I grabbed the above from here SQL find duplicate records occuring within 1 minute of each other . I am hoping I can implement a similar solution in Pig but I am finding that apparently Pig does not support JOIN via an expression (only by fields) as is required by the above join. Do you know how to de-duplicate events that are near by 1 minute with Pig? Thanks!