0

I'm new to programming in PIG and I have a relation with multiple fields (I'm simplifying the schema in this example below). I'm doing some calculations multiple times, and at the end I'm trying to join the results. But I get no results, and if I run a describe the schema seems to be correct. Also, when looking at the syntax check the only thing that catches my eyes is this warning: WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY.

Input

(123,1,-52.39,-1,2006-05-15)
(123,1,-52.39,-1,2007-04-04)
(123,2,-55.15,-1,2006-05-15)
(123,3,-49.64,-1,2006-05-15)
(123,4,52.39,1,2006-05-15)
(123,4,-52.39,-1,2007-04-04)
(123,4,52.39,1,2007-04-04)
(123,4,-52.39,-1,2007-04-09)
(123,5,86.86,1,2007-04-04)
(123,5,-86.86,-1,2007-04-09)

Desired output:

(123,1,-104.78,-2,2007-04-04)
(123,2,-55.15,-1,2006-05-15)
(123,3,-49.64,-1,2006-05-15)
(123,4,0,0,2007-04-09)
(123,5,0,0,2007-04-09)
c1 = load 'file.csv' using PigStorage(',') as (ID, LN, PAY_AMT:double,UNIT_QTY:int, PD_DT);
c2 = FOREACH c1 GENERATE ID, LN, PAY_AMT, UNIT_QTY;
c3 = group c2 by (ID, LN);
c3agg = FOREACH c3 GENERATE FLATTEN(group) as (ID,LN),
      SUM(c2.PAY_AMT) as PdAmt, SUM(c2.UNIT_QTY) as Unit_qty;

describe c3agg;

c3agg: {ID: bytearray,LN: bytearray,PdAmt: double,Unit_qty: long}

So now I'm trying to get the MAX(PD_DT) since using the actual MAX operator doesn't work (or at least I can't figure it out to work without using this code below).

c4 = foreach c1 generate ID, LN, PD_DT;
c5 = group c4 by (ID, LN);
c3dt = FOREACH c5 {                 -- get MAX(PD_DT), 
    c5ord = ORDER c4 by PD_DT DESC;
    c5lmt = LIMIT c5ord 1;
    GENERATE FLATTEN(c5lmt);};

describe c3dt;

c3dt: {c5lmt::ID: bytearray,c5lmt::LN: bytearray,c5lmt::PD_DT:bytearray}

Now trying for the join, which doesn't return anything:

cj = JOIN c3agg BY (ID, LN), c3dt BY (ID, LN);
dump cj;

I tried using field position but with the same result. cj = join c3agg by ($0, $1), c3dt BY ($0, $1);

describe cj;
cj: {c3agg::ID: bytearray,c3agg::LN: bytearray,c3agg::PdAmt: double,c3agg::Unit_qty: long,c3dt::c5lmt::ID: bytearray,c3dt::c5lmt::LN: bytearray,c3dt::c5lmt::PD_DT: bytearray}

Also, I tried defining the field type , for example ID:chararray and LN:int, but still no results. I really can't figure it out what am I doing wrong?

Thank you!

  • why do you say the MAX operator doesnt work and can you show the code that you tried with MAX. I presume if we can get that to work then we will solve a big chunk of yourproblem. – robthewolf Mar 30 '14 at 11:40
  • I tried getting the MAX(PD_DT) in c3agg but I get this Err: ERROR 2106: Error executing an algebraic function. `c2 = FOREACH c1 GENERATE ID, LN, PAY_AMT, UNIT_QTY, PD_DT; c3 = group c2 by (ID, LN); c3agg = FOREACH c3 GENERATE FLATTEN(group) as (ID,LN), SUM(c2.PAY_AMT) as PdAmt, SUM(c2.UNIT_QTY) as Unit_qty, MAX(PD_DT) as PdDt;` I assume because of the data type for PD_DT this doesn't work. – mimiBlue Apr 01 '14 at 18:26
  • Ok, I just answered myself with the data type.. I needed to declare PD_DT:chararray and now MAX works.. Thank you @robthewolf for bringing that up! On to the next phase of the calculations.. – mimiBlue Apr 01 '14 at 18:50

0 Answers0