I have a problem that I am not sure how to solve in Pig. I have a dataset on Hadoop (approx. 4 million records) which contains product titles by product category. Each title has the no. of times it showed up on the web page, and no. of times it was clicked on to go to a product details page. The no. of titles within a product category can vary.
Sample Data -
Video Games|Halo 4|5400|25
Video Games|Forza Motorsport 4 Limited Collector's Edition|6000|10
Video Games|Marvel Ultimate Alliance|2000|55
Cameras & Photo|Pro Steadicam for GoPro HD|12000|250
Cameras & Photo|Hero GoPro Motorsports 1080P Wide HD 5MP Helmet Camera|10000|125
I want to get the top N % of records within each product category, based on the 3rd column (appearances on the web page). However, the N % has to vary based on the weight/importance of the category. Eg. For Video Games, I want to get the Top 15 % records; For Camera & Photo, I want to get the Top 5 %, etc. Is there a way to dynamically set the % or Integer value in the LIMIT clause within a nested FOREACH block of code in Pig?
PRODUCT_DATA = LOAD '<PRODUCT FILE PATH>' USING PigStorage('|') AS (categ_name:chararray, product_titl:chararray, impression_cnt:long, click_through_cnt:long);
GRP_PROD_DATA = GROUP PRODUCT_DATA BY categ_name;
TOP_PROD_LIST = FOREACH GRP_PROD_DATA {
SORTED_TOP_PROD = ORDER PRODUCT_DATA BY impression_cnt DESC;
SAMPLED_DATA = LIMIT SORTED_TOP_PROD <CATEGORY % OR INTEGER VALUE>;
GENERATE flatten(SAMPLED_DATA);
}
STORE TOP_PROD_TITLE_LIST INTO '<SOME PATH>' USING PigStorage('|');
How can I dynamically (by category) set the % or integer value for the given group? I thought of using a MACRO but MACROS can not be called from within a NESTED FOREACH block. Can I write a UDF which will take category name as a parameter, and output the % OR INTEGER value, and have this UDF be called from a LIMIT operation?
SAMPLED_DATA = LIMIT SORTED_TOP_PROD categLimitVal(categ_name);
Any suggestions? I am using version 0.10 of Pig.