For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2)
, you can do distinct only on a projection that has columns ($0, $2)
or ($0)
and lose $1
.
In order to select one record per user (any record) you could use a GROUP BY
and a nested FOREACH
with LIMIT
. Ex:
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.