1

Suppose, I have the following data in PIG.

DUMP raw;
(2015-09-15T22:11:00.000-07:00,1)
(2015-09-15T22:12:00.000-07:00,2)
(2015-09-15T23:11:00.000-07:00,3)
(2015-09-16T21:02:00.000-07:00,4)
(2015-09-15T00:02:00.000-07:00,5)
(2015-09-17T08:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,1)
(2015-09-17T19:02:00.000-07:00,1)

DESCRIBE raw;
raw: {process_date: chararray,id: int}

A = GROUP raw BY id;
DESCRIBE A;
A: {group: int,raw: {(process_date: chararray,id: int)}}
DUMP A;

 (1,{(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)})
(2,{(2015-09-15T22:12:00.000-07:00,2)})
(3,{(2015-09-15T23:11:00.000-07:00,3)})
(4,{(2015-09-16T21:02:00.000-07:00,4)})
(5,{(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)})


    B = FOREACH A {generate raw,MAX(raw.process_date) AS max_date;}
    DUMP B;
        ({(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)},2015-09-17T19:02:00.000-07:00)
({(2015-09-15T22:12:00.000-07:00,2)},2015-09-15T22:12:00.000-07:00)
({(2015-09-15T23:11:00.000-07:00,3)},2015-09-15T23:11:00.000-07:00)
({(2015-09-16T21:02:00.000-07:00,4)},2015-09-16T21:02:00.000-07:00)
({(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)},2015-09-17T09:02:00.000-07:00)

    DESCRIBE B;
    B: {raw: {(process_date: chararray,id: int)},max_date: chararray}

Now, I need to filter raw based on process_date eq max_date. I have tried the following:

C = FOREACH B {filtered = FILTER raw BY REGEX_EXTRACT(process_date,'(\\d{4}-\\d{2}-\\d{2})',1) eq REGEX_EXTRACT(max_date,'(\\d{4}-\\d{2}-\\d{2})',1)}, but its not working.

Is there any way to do such filtering? Basically, I need to filter the raw based on latest date. The exception which I get is:

Invalid field projection. Projected field [max_date] does not exist in schema: process_date:chararray,id:int

Expected output: Latest data corresponding to latest date (not time) for each id

({(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)})
({(2015-09-15T22:12:00.000-07:00,2)})
({(2015-09-15T23:11:00.000-07:00,3)})
({(2015-09-16T21:02:00.000-07:00,4)})
({(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)})
Ravi
  • 55
  • 7
  • You already used C as a relation, try with another – Abhis Sep 21 '15 at 20:47
  • Its not this problem. I tried with other name as well, its not working, but I can use C as well because first time I used C in nested FOREACH which is not exposed to outside FOREACH. Even C can be used outside, but the problem is using a filter clause from some other relation. – Ravi Sep 21 '15 at 20:58
  • Can you show us your desired output? – Abhis Sep 21 '15 at 20:58
  • The desired output would be something like: ({(2015-09-17T00:02:00.000-07:00,5)}), As I said earlier, filtering raw based on max_date – Ravi Sep 21 '15 at 21:03
  • @Ravi : Is your objective to select the latest records based on date ? – Murali Rao Sep 21 '15 at 21:55
  • Yes, I need to get the latest records based on whatever the latest date is.... – Ravi Sep 21 '15 at 22:05
  • One prerequisite is that the data would be grouped.....There would be many groups of data (like there ia only 1 group above)....From these groups, I need to take the latest date data.. – Ravi Sep 21 '15 at 22:09
  • @Ravi : This record is incorrect : (2015-09-16T24:02:00.000-07:00,4) hour is 24 ? – Murali Rao Sep 21 '15 at 23:31
  • @Ravi : Have added a snippet, try and let me know in case of any issues. – Murali Rao Sep 21 '15 at 23:37
  • @Ravi : I dont understand why in output for cid : 1 and 5 you are having two records, If your objective is to have the latest record ? – Murali Rao Sep 22 '15 at 01:55

0 Answers0