1

Is it Possible to get the following output in PIG ? Will i be able to use Group by 1st and 2nd field and then do DISTINCT on 3rd field ?

For example
I have input data

12345|9658965|52145
12345|9658965|52145
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585



 I want output something like

    12345|9658965|52145
    23456|8541232|96589
    23456|8541232|96585
pd123
  • 41
  • 1
  • 7

2 Answers2

2

Approach 1 : Using DISTINCT

Ref : http://pig.apache.org/docs/r0.12.0/basic.html#distinct

DISTINCT operator should help

test = LOAD 'test.csv' USING PigStorage('|');
distinct_recs = DISTINCT test;
DUMP distinct_recs;

Approach 2 : GROUP BY all fields

test = LOAD 'test.csv' USING PigStorage('|');
grp_all_fields = GROUP test BY ($0,$1,$2);
uniq_recs = FOREACH grp_all_fields GENERATE FLATTEN(group);
DUMP uniq_recs;

Both approaches are giving the expected output for the input shared.

Murali Rao
  • 2,287
  • 11
  • 18
  • Hi, I tried DISTINCT function of Pig. It also removes non-distinct records too. It only gives once instance of 23456|8541232|96585 this instead of two. – pd123 Mar 02 '17 at 01:17
  • Did it help, if yes you can accept this as the answer, otherwise share the issue faced – Murali Rao Mar 02 '17 at 01:19
  • @pd123 : Can you share the code which you tried and getting only one record ? For the input shared running the code which I have shared is getting me the expected output. – Murali Rao Mar 02 '17 at 04:04
  • @pd123: Updated code with one more approach to achieve the objective, give a try. In my test run I am seeing expected output in both approaches. – Murali Rao Mar 02 '17 at 18:20
0

Try this , its pretty similar :

A = LOAD 'test.csv' USING PigStorage('|') as (a1,a2,a3);
    unique  =
        FOREACH (GROUP A BY a3) {
            b = A.(a1,a2);
            s = DISTINCT b;
            GENERATE FLATTEN(s), group AS a4;
        };
Community
  • 1
  • 1
San
  • 161
  • 3
  • 13