1

I am new to Apache Pig. I want to split and flatten the following input into my required output like who are all viewed that product.

My Input :(UserId, ProductId)

12345   123456,23456,987653  
23456   23456,123456,234567  
34567   234567,765678,987653

My Required Output:(ProductId, UserId)

123456  12345  
123456  23456  
23456   12345    
23456   23456  
987653  12345  
987653  34567  
234567  23456  
234567  34567  
765678  34567

My Pig Scripts:

 a = load '/home/hadoopuser/ips' using PigStorage('\t') as (key:chararray, val:chararray);  
 b = foreach a generate key as ky1, FLATTEN(TOKENIZE(val)) as vl1;  
 c = group b by vl1;  
 d = foreach c generate group as vl2, $1 as ky2;  
 e = foreach d generate vl2, BagToString(ky2) as kyy;  
 f = foreach e generate vl2 as vl3,FLATTEN(STRSPLIT(kyy,'_')) as ky3;  
 g = foreach f generate vl3, FLATTEN(TOKENIZE(ky3)) as kk1; 
 dump g; 

I got the following output which eliminates the repeated (duplicate)values,

(23456,12345)  
(123456,12345)  
(234567,23456)  
(765678,34567)  
(987653,12345)  

I don't know how to solve this problem. Can anyone help me to solve this problem? and how to do this in a simple way?

f_puras
  • 2,521
  • 4
  • 33
  • 38
Karthick S
  • 25
  • 4

1 Answers1

0

Well, the second line of your code does exactly what you want, it simply displays the customer first and the product second. Put first the FLATTEN and then the key part:

a = load '/home/hadoopuser/ips' using PigStorage('\t') as (key:chararray, val:chararray);  
b = foreach a generate FLATTEN(TOKENIZE(val)) as ProductId, key as UserId;
dump b; 

(123456,12345)
(23456,12345)
(987653,12345)
(23456,23456)
(123456,23456)
(234567,23456)
(234567,34567)
(765678,34567)
(987653,34567)

As to why you are getting only one result per ProductId with your current code, you are grouping by ProductId, which gives you one row per different ProductId with a bag that contains all of the customers who viewed that product. Then, you convert that bag to a huge string separated by _, to convert it again to the same bag as before:

d = foreach c generate group as vl2, $1 as ky2;  
e = foreach d generate vl2, BagToString(ky2) as kyy;  
f = foreach e generate vl2 as vl3,FLATTEN(STRSPLIT(kyy,'_')) as ky3;  

The BagToString UDF converts a bag to a string, joining the different values in the bag separated by a custom delimiter, which defaults to _. In the next line, however, you split it by _ resulting in the same bag as before. However, you FLATTEN that bag, so now instead of having a row with the ProductId and a bag, you have a row with several fields, being the first the ProductId, and the following fields all the customers that viewed the product:

Before FLATTEN:

(23456,{(23456,23456),(12345,23456)})
(123456,{(23456,123456),(12345,123456)})
(234567,{(34567,234567),(23456,234567)})
(765678,{(34567,765678)})
(987653,{(34567,987653),(12345,987653)})

After FLATTEN:

(23456,23456,23456,12345,23456)
(123456,23456,123456,12345,123456)
(234567,34567,234567,23456,234567)
(765678,34567,765678)
(987653,34567,987653,12345,987653)

And here lies the error. You have one only row for each of the products, and several fields in each row for each customer. When applying the last foreach, you select the first field (the product) and the second (the first of all the customers), discarding the rest of the customers on each row.

Balduz
  • 3,560
  • 19
  • 35
  • Hi Balduz, Thanks for your reply. It's working properly and I understood the problem very well by your clear explanation. – Karthick S Jul 28 '15 at 12:46