1

I am trying to sort tuples inside a bag based on three fields in descending order..

Example : Suppose I have the following bag created by grouping:

{(s,3,my),(w,7,pr),(q,2,je)}

I want to sort the tuples in the above grouped bag based on $0,$1,$2 fields in such a way that first it will sort on $0 of all the tuples. It will pick the tuple with largest $0 value. If $0 are same for all the tuples then it will sort on $1 and so on.

The sorting should be for all the grouped bags through iterating process.

Suppose if we have databag something like:

{(21,25,34),(21,28,64),(21,25,52)}

Then according to the requirement output should be like:

{(21,25,34),(21,25,52),(21,28,64)}

Please let me know if you need any more clarification

Evaldas Buinauskas
  • 13,739
  • 11
  • 55
  • 107
USY
  • 61
  • 8
  • So how should your output look like? – Vignesh I Oct 19 '15 at 14:41
  • The required output for the above databag would be {(q,2,je),(s,3,my),(w,7,pr)}..But suppose if we have databag something like {(21,25,34),(21,28,64),(21,25,52)} then according to the requirement output should be like {(21,25,34),(21,25,52),(21,28,64)}..Please let me know if you need any more clarification. – USY Oct 19 '15 at 15:05
  • Added expected output from comment to question – Evaldas Buinauskas Oct 21 '15 at 11:16

1 Answers1

1

Order your tuple in a nested foreach. This will work.

Input:

(1,s,3,my)
(1,w,7,pr)
(1,q,2,je)


A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;                                                                                            
C = FOREACH B GENERATE A;                                                                                    
D = FOREACH C {                                                                                              
 od = ORDER A BY b, c, d;                                                                                     
 GENERATE od;                                                                                                 
 };

DUMP C Result(which resembles your data):

({(1,s,3,my),(1,w,7,pr),(1,q,2,je)})

Output:

({(1,q,2,je),(1,s,3,my),(1,w,7,pr)})

This will work for all the cases.

Generate tuple with highest value:

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;                                                                                            
C = FOREACH B GENERATE A;                                                                                    
D = FOREACH C {  
 od = ORDER A BY b desc , c desc , d desc;
 od1 = LIMIT od 1;                        
 GENERATE od1;                            
 };
dump D;

Generate tuple with highest value if all the three fields are different, if all the tuples are same or if field 1 and field2 are same then return all the tuple.

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;                                                                                            
C = FOREACH B GENERATE A; 
F = RANK C; //rank used to separate out the value if two tuples are same                                    
R = FOREACH F {    
dis = distinct A;                                      
GENERATE rank_C,COUNT(dis) AS (cnt:long),A;                 
};
R3 = FILTER R BY cnt!=1; // filter if all the tuples are same
 R4 = FOREACH R3 {                          
 fil1 = ORDER A by b desc, c desc, d desc;
 fil2 = LIMIT fil1 1;                       
 GENERATE rank_C,fil2;                             
 }; // find largest tuple except if all the tuples are same.
R5 = FILTER R BY cnt==1; // only contains if all the tuples are same
R6 = FOREACH R5 GENERATE A ; // generate required fields
F1 = FOREACH F GENERATE rank_C,FLATTEN(A); 
F2 = GROUP F1 BY (rank_C, A::b, A::c); // group by field 1,field 2 
F3 = FOREACH F2 GENERATE COUNT(F1) AS (cnt1:long) ,F1; // if count = 2 then Tuples are same on field 1 and field 2
F4 = FILTER F3 BY cnt1==2; //separate that alone
F5 = FOREACH F4 {                    
DIS = distinct F1;                   
GENERATE flatten(DIS);
 };
F8 = JOIN F BY rank_C, F5 by rank_C;
F9 = FOREACH F8 GENERATE F::A;
Z = cross R4,F5; // cross done to genearte if all the tuples are different
Z1 = FILTER Z BY R4::rank_C!=F5::DIS::rank_C;
Z2 = FOREACH Z1 GENERATE FLATTEN(R4::fil2);
res = UNION Z2,R6,F9;  // Z2 - contains value if all the three fields in the tuple are diff holds highest value, 
//R6 - contains value if all the three fields in the tuple are same
//F9 - conatains if two fields of the tuples are same
dump res;
Vignesh I
  • 2,211
  • 2
  • 20
  • 40
  • Thanks for your help.I have one more requirement on top of it.For the above example i need to find out the tuple with highest value on $0.If $0 are same for all the tuples then fetch the tuple with highest value on $1.So for the databag {(21,25,34),(21,25,52),(21,28,64)} the output would be {(21,28,64)}.If all the $0,$1 and $2 fields are same then it should return all the tuples. – USY Oct 19 '15 at 17:19
  • Edited the answer. one thing to be noted is if all three fields are same then you will get only one tuple. If the post resolves your query accept the answer. – Vignesh I Oct 19 '15 at 17:40
  • Hi Vignesh...still looking for finding out the way to get all the tuples if all the three fields are same..Can you please help me out to find it. – USY Oct 20 '15 at 05:14
  • Done.Little bit tricky. Updated the answer with comments. This will allow you to get the result if all the tuples are same then it will return all the tuples else the largest tuple. – Vignesh I Oct 20 '15 at 06:49
  • One last help i need from you for the above dataset .If I want to sort the above data based on following criteria: 1.First sort on $0 field in Descending order for all the tuples and take the tuple with largest $0 field. 2.If $0 field is same for all the tuples then sort on $1 filed and take the tuple with largest $1 filed. 3.If the fields($0,$1) are same in all the tuples then take all the tuples – USY Oct 20 '15 at 08:15
  • That is what achieved in the latest answer. – Vignesh I Oct 20 '15 at 08:22
  • Actually in your latest answer we can achieve if all the fields in those tuples are same.But if only $0 and $1 fields are same then we need to fetch all the tuples.Please let me know if you need any more calrification – USY Oct 20 '15 at 08:26
  • Updated the answer with your requirement. – Vignesh I Oct 20 '15 at 10:44
  • Hi Vignesh...I need some further help on this same topic – USY Nov 18 '15 at 06:37