mulitple column combined unique in pig

Question

I would like to perform a DISTINCT operation on a subset of the columns.

A = LOAD 'data' AS(a1,a2,a3,a4,a5,a6);

DUMP A;

(1, 2, 3, 4,5,5_1)

(1, 2, 3, 4,5,5_1)

(1, 2, 3, 4,6,6_1)

(1 ,2, 4, 4,7,7_1)

(1, 2, 4, 4,8,8_1) 

-- insert DISTINCT operation on a1,a2,a3,a4 here:

-- ...
DUMP A_unique;

(1, 2, 3, 4,5,5_1)

(1, 2, 4, 4,7,7_1)

I have already referred the link:

How to perform a DISTINCT in Pig Latin on a subset of columns?

And used the below two ways:

Method 1:

1.DATA = LOAD '/usr/local/Input.txt' AS (a1,a2,a3,a4,a5,a6);    
2.DATA2 = FOREACH DATA GENERATE TOTUPLE(a1,a2,a3,a4) AS combined, a5 as a5,a6 as a6;
3.grouped_by_a5_a6 = GROUP DATA2 BY combined;

4.grouped_and_distinct = FOREACH grouped_by_a5_a6 {

             combined_unique =LIMIT DATA2 1;

                   GENERATE FLATTEN(combined_unique);
};

Method 2:

DATA = LOAD '/usr/local/Input.txt' AS (a1,a2,a3,a4,a5,a6) ;        
A2 = FOREACH DATA GENERATE TOTUPLE(a1,a2,a3,a4) AS combined, a5 as a5,a6 as a6 ;

grouped_by_a5_a6 = GROUP A2 BY (a5,a6);

grouped_and_distinct = FOREACH grouped_by_a5_a6 {

        combined_unique = DISTINCT A2.combined;

        GENERATE FLATTEN(combined_unique);
};

But I am getting answer like:

(1, 2, 3, 4,5,5_1)
(1, 2, 3, 4,6,6_1)
(1, 2, 4, 4,7,7_1)
(1, 2, 4, 4,8,8_1)

Instead of:

(1, 2, 3, 4,5,5_1)
(1, 2, 4, 4,7,7_1)

What is wrong in the above mentioned codes?

Why you think (1, 2, 3, 4,6,6_1) and (1, 2, 4, 4,8,8_1) should not be in the output? I don't see any reason why they would be filtered out and it looks like the correct result to me. — bridiver, Apr 16 '14 at 12:43
bridiver you are correct but I need the output as I mentioned. could you suggest the change in the code? — user2940111, Apr 16 '14 at 15:18
I'm sure sure why you expect the other records to be filtered out. What is the basis for the filtering? — bridiver, Apr 16 '14 at 15:39

score 0 · Answer 1 · answered Apr 16 '14 at 17:41

0

What you are expecting is not the result of a distinct on those fields. To get the output you want you would have to apply a filter.

answered Apr 16 '14 at 17:41

bridiver

1,694
12
13

mulitple column combined unique in pig

1 Answers1