I would like to perform a DISTINCT operation on a subset of the columns.
A = LOAD 'data' AS(a1,a2,a3,a4,a5,a6);
DUMP A;
(1, 2, 3, 4,5,5_1)
(1, 2, 3, 4,5,5_1)
(1, 2, 3, 4,6,6_1)
(1 ,2, 4, 4,7,7_1)
(1, 2, 4, 4,8,8_1)
-- insert DISTINCT operation on a1,a2,a3,a4 here:
-- ...
DUMP A_unique;
(1, 2, 3, 4,5,5_1)
(1, 2, 4, 4,7,7_1)
I have already referred the link:
How to perform a DISTINCT in Pig Latin on a subset of columns?
And used the below two ways:
Method 1:
1.DATA = LOAD '/usr/local/Input.txt' AS (a1,a2,a3,a4,a5,a6);
2.DATA2 = FOREACH DATA GENERATE TOTUPLE(a1,a2,a3,a4) AS combined, a5 as a5,a6 as a6;
3.grouped_by_a5_a6 = GROUP DATA2 BY combined;
4.grouped_and_distinct = FOREACH grouped_by_a5_a6 {
combined_unique =LIMIT DATA2 1;
GENERATE FLATTEN(combined_unique);
};
Method 2:
DATA = LOAD '/usr/local/Input.txt' AS (a1,a2,a3,a4,a5,a6) ;
A2 = FOREACH DATA GENERATE TOTUPLE(a1,a2,a3,a4) AS combined, a5 as a5,a6 as a6 ;
grouped_by_a5_a6 = GROUP A2 BY (a5,a6);
grouped_and_distinct = FOREACH grouped_by_a5_a6 {
combined_unique = DISTINCT A2.combined;
GENERATE FLATTEN(combined_unique);
};
But I am getting answer like:
(1, 2, 3, 4,5,5_1)
(1, 2, 3, 4,6,6_1)
(1, 2, 4, 4,7,7_1)
(1, 2, 4, 4,8,8_1)
Instead of:
(1, 2, 3, 4,5,5_1)
(1, 2, 4, 4,7,7_1)
What is wrong in the above mentioned codes?