I have a dataset with a large number of fields & rows. I would like to perform a hierarchical group-by but can't seem to figure out how to access the fields in the grouped dataset.
For example, say we have (id, firstname, lastname, age, phone, city).
student_details = LOAD 'student_details.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
group_1 = GROUP student_details by (age,phone,id);
group_2 = GROUP group_1 by (group.age,group.phone);
group_3 = GROUP group_2 by (group.age);
These groups are being computed properly I am having trouble when I try to access the data, for example:
data_1 = FOREACH group_1 GENERATE group.age,group.phone,group.id,COUNT(student_details.city);
data_2 = FOREACH group_2 GENERATE group.age,group.phone,COUNT(group_1.student_details.city);
The last line causes an error
Cannot find field city in student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}
Is it because student_details is a bag and I would need to run a for-each to access tuple inside the bag? Is there a straight-forward way to do this?
-- UPDATE --
Sample Data:
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
009,ABC,DEF,111,9834534343,Delhi
009,ABC,DEF,111,9834534343,Delhi
009,ABC,DEF,111,9834534343,Delhi
Expected output would be exactly the same if we would have run the following code:
student_details = LOAD 'student_details.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray);
group_1 = GROUP student_details by (age,phone,id);
group_2 = GROUP student_details by (age,phone);
data_1 = FOREACH group_1 GENERATE group.age,group.phone,group.id,COUNT(student_details.city);
data_2 = FOREACH group_2 GENERATE group.age,group.phone,COUNT(student_details.city);
STORE data_1..
STORE data_2..
But I don't want to use student_details twice in Line 2 and 3.
This question talks about dropping tuples after a group-by. I do not want to drop any tuples, I want to do another group-by on a subset of the keys Using FLATTEN would mean that I loose the group-by which was performed in group_1.