9

I'm using PigLatin to filter some records.

User1  8 NYC 
User1  9 NYC 
User1  7 LA 
User2  4 NYC
User2  3 DC 

The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.

The output should be:

User1 8 NYC 
User2 4 NYC

Any suggestions?

SetFreeByTruth
  • 819
  • 8
  • 23
aalsum
  • 103
  • 1
  • 1
  • 4

2 Answers2

20

For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2), you can do distinct only on a projection that has columns ($0, $2) or ($0) and lose $1.

In order to select one record per user (any record) you could use a GROUP BY and a nested FOREACH with LIMIT. Ex:

inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.

Cihan Keser
  • 3,190
  • 4
  • 30
  • 43
alexeipab
  • 3,609
  • 14
  • 16
0

Pig provide DISTINCT command to select unique data. If you want use distinct on fields Use Distinct in foreach nested block.

  • Be careful while using Distinct ..The drawback with DISTINCT keyword is : You cannot be sure that only first record will be removed. – java_enthu Oct 10 '13 at 12:32