Hierarchical Group By in Pig

Question

I have a dataset with a large number of fields & rows. I would like to perform a hierarchical group-by but can't seem to figure out how to access the fields in the grouped dataset.

For example, say we have (id, firstname, lastname, age, phone, city).

student_details = LOAD 'student_details.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
group_1 = GROUP student_details by (age,phone,id);
group_2 = GROUP group_1 by (group.age,group.phone);
group_3 = GROUP group_2 by (group.age);

These groups are being computed properly I am having trouble when I try to access the data, for example:

data_1 = FOREACH group_1 GENERATE group.age,group.phone,group.id,COUNT(student_details.city);
data_2 = FOREACH group_2 GENERATE group.age,group.phone,COUNT(group_1.student_details.city);

The last line causes an error Cannot find field city in student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}

Is it because student_details is a bag and I would need to run a for-each to access tuple inside the bag? Is there a straight-forward way to do this?

-- UPDATE --

Sample Data:

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
009,ABC,DEF,111,9834534343,Delhi
009,ABC,DEF,111,9834534343,Delhi
009,ABC,DEF,111,9834534343,Delhi

Expected output would be exactly the same if we would have run the following code:

student_details = LOAD 'student_details.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
group_1 = GROUP student_details by (age,phone,id);
group_2 = GROUP student_details by (age,phone);
data_1 = FOREACH group_1 GENERATE group.age,group.phone,group.id,COUNT(student_details.city);
data_2 = FOREACH group_2 GENERATE group.age,group.phone,COUNT(student_details.city);
STORE data_1..
STORE data_2..

But I don't want to use student_details twice in Line 2 and 3.

This question talks about dropping tuples after a group-by. I do not want to drop any tuples, I want to do another group-by on a subset of the keys Using FLATTEN would mean that I loose the group-by which was performed in group_1.

Possible duplicate of [Projecting Grouped Tuples in Pig](http://stackoverflow.com/questions/10808202/projecting-grouped-tuples-in-pig) — Evan Carslake, Mar 22 '16 at 20:52
It talks about dropping tuples after a group-by. I do not want to drop any tuples, I want to do another group-by on a subset of the keys and then access the tuples. — Rachit Kyte.One, Mar 23 '16 at 06:53

score 1 · Accepted Answer · answered Mar 23 '16 at 10:56

1

For hierarchical GROUP BY you need CUBE operation. Here is an example that might get your problem:

student_details = LOAD 'data.csv' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
cubed = CUBE student_details BY ROLLUP(age,phone,id);
result = FOREACH cubed GENERATE FLATTEN(group) as (age,phone,id),  COUNT_STAR(cube) as CNT;
result = FILTER result BY age is not NULL and phone is not NULL;
DUMP result;

answered Mar 23 '16 at 10:56

alexeipab

3,609
14
16

Thanks for the help, but the cube operator would perform aggregates across all data & I can't do that in my case since the number of attributes is very large & so is the difference. For example, I might need `GROUP BY(a1,a2,a3,a4....a15)` and another `GROUP BY(a1,a2)`. So I feel it would be computationally very expensive to do a cube across all combinations of all 15 attributes. Please correct me if I'm wrong – Rachit Kyte.One Mar 23 '16 at 12:09
ROLLUP will limit the number of permutations, but these operations are implemented internally and are quite efficient, using secondary sort tricks and running sums which do not require duplication of data. In my expirience CUBE and ROLLUP never caused any performance issues. If you find that it is slow and you have very limited number of group by, you could try using http://datafu.incubator.apache.org/docs/datafu/1.3.0/datafu/pig/bags/BagGroup.html – alexeipab Mar 23 '16 at 12:46
Thank you for the answer, I guess that's the closest to solve my issue. – Rachit Kyte.One Mar 24 '16 at 11:56

Hierarchical Group By in Pig

1 Answers1