0

Suppose my data looks like this with columns named food, action, and population:

pizzas   eatenBy  humans
pizzas   eatenBy  collegeKids
pizzas   eatenBy  everyOne
pizzas   grownBy  farmers
sprouts  grownBy  sproutFarmers
sprouts  grownBy  humans

How can I write a Pig Latin script to produce ONLY a unique food & action, with any valid population from the distinct food & action group?

ie, the only output I'd like from the above data would be this (though the population of the 1st and 3rd lines could be different):

pizzas   eatenBy  everyOne
pizzas   grownBy  farmers
sprouts  grownBy  sproutFarmers

Thank you,

user2250400
  • 51
  • 1
  • 3

1 Answers1

1

Don't know how you'd do this with DISTINCT (which is more efficient than what I'm about to suggest), but you could do this:

food = load 'foodInput' AS (foodType,action,population);
foodGrouped = GROUP food by (foodType,action);
foodLimited = foreach foodGrouped {
    limited = LIMIT food 1;
    GENERATE FLATTEN(limited.(foodType,action,population));
};
DMulligan
  • 8,993
  • 6
  • 33
  • 34