How to dynamically get Top N percent records within group in Pig

Question

I have a problem that I am not sure how to solve in Pig. I have a dataset on Hadoop (approx. 4 million records) which contains product titles by product category. Each title has the no. of times it showed up on the web page, and no. of times it was clicked on to go to a product details page. The no. of titles within a product category can vary.

Sample Data -

I want to get the top N % of records within each product category, based on the 3rd column (appearances on the web page). However, the N % has to vary based on the weight/importance of the category. Eg. For Video Games, I want to get the Top 15 % records; For Camera & Photo, I want to get the Top 5 %, etc. Is there a way to dynamically set the % or Integer value in the LIMIT clause within a nested FOREACH block of code in Pig?

PRODUCT_DATA = LOAD '<PRODUCT FILE PATH>' USING PigStorage('|') AS (categ_name:chararray, product_titl:chararray, impression_cnt:long, click_through_cnt:long);

GRP_PROD_DATA = GROUP PRODUCT_DATA BY categ_name;

TOP_PROD_LIST = FOREACH GRP_PROD_DATA {

                  SORTED_TOP_PROD = ORDER PRODUCT_DATA BY impression_cnt DESC;
                  SAMPLED_DATA = LIMIT SORTED_TOP_PROD <CATEGORY % OR INTEGER VALUE>;
                  GENERATE flatten(SAMPLED_DATA);
                }

STORE TOP_PROD_TITLE_LIST INTO '<SOME PATH>' USING PigStorage('|');

How can I dynamically (by category) set the % or integer value for the given group? I thought of using a MACRO but MACROS can not be called from within a NESTED FOREACH block. Can I write a UDF which will take category name as a parameter, and output the % OR INTEGER value, and have this UDF be called from a LIMIT operation?

SAMPLED_DATA = LIMIT SORTED_TOP_PROD categLimitVal(categ_name);

Any suggestions? I am using version 0.10 of Pig.

You an use Hadoop streaming and replace the `LIMIT` line with `STREAM SORTED_TOP_PROD THROUGH \`awk ...\`;`. However, I suggest you rethink your use case because to be able to get the top % records you need to count them first (or keep them all in memory in the awk script) and then take the top % depending on a map of category to percents. This is sort of anti-MapReduce. Top X values is fast to process; top Y% is not. — cabad, Oct 03 '13 at 21:39

score 0 · Answer 1 · edited May 23 '17 at 10:28

Something like this may work. However, I've never had the need to look up variable keys in a Pig map, and this other SO question doesn't have an answer, so you'll need to do some trial and error to make it work:

--Load your dynamic percentages as a map
A = LOAD 'percentages' AS (categ_name:chararray, perc:float);
PERCENTAGES = FOREACH A GENERATE TOMAP(categ_name, perc);

PRODUCT_DATA = LOAD ...;
GRP_PROD_DATA = GROUP PRODUCT_DATA BY categ_name;

--Count the elements per group; needed to calculate pecentages
C = FOREACH GRP_PROD_DATA generate FLATTEN(group) AS categ_name, COUNT(*) as count;
c_MAP = FOREACH C GENERATE TOMAP(categ_name, count);

TOP_PROD_LIST = FOREACH GRP_PROD_DATA {
    SORTED_TOP_PROD = ORDER PRODUCT_DATA BY impression_cnt DESC;
    SAMPLED_DATA = LIMIT SORTED_TOP_PROD (C_MAP#group * PERCENTAGES#group);
    GENERATE flatten(SAMPLED_DATA);
}

You could also try using Pig's TOP function instead of ORDER + LIMIT.

Thanks for answering my question Cabad. Unfortunately I tried the above approach and it didnt work. LIMIT does not allow anything else but an integer or scalar expression. It does not allow for another column in the relation to be referenced. Neither Limit nor TOP will work in this case it looks like. Not sure what else I can try. — axm2064, Oct 04 '13 at 07:30
Pig allows [casting relations to scalars](https://issues.apache.org/jira/browse/PIG-1434), so doing that may fix your problem. Alternatively, the `awk` solution that I suggested in my comment to your question would definitely work. — cabad, Oct 04 '13 at 13:53

score 0 · Answer 2 · edited Oct 08 '13 at 09:26

I think I solved it using a slightly different approach. I am not sure how optimized it is, maybe there is a better way to organize/optimize the script. Basically, if I rank the product titles within each category in ASC order of impression count and filter when the RANK <= SAMPLE LIMIT of the category, then I can simulate the dynamic sampling. The SAMPLE LIMIT is nothing but the COUNT of titles per category * PERCENT WEIGHT defined per category. To RANK the tuples, I am leveraging LinkedIn's DataFu open source jar that provides an ENUMERATE UDF.

Again, if anyone has suggestions on improving/better organizing the code, I am all ears :) Thanks for your input Cabad, it really helped!

Script:

REGISTER '/tmp/udf/datafu-1.0.0.jar';
define Enumerate datafu.pig.bags.Enumerate('1');
set default_parallel 10;

LKP_DATA = LOAD '/tmp/lkp.dat' USING PigStorage('|') AS (categ_name:chararray, perc:float);
PRODUCT_DATA = LOAD '/tmp/meta.dat' USING PigStorage('|') AS (categ_name:chararray, product_titl:chararray, impression_cnt:long, click_through_cnt:long);

GRP_PROD_DATA = GROUP PRODUCT_DATA BY categ_name;

CATEG_COUNT = FOREACH GRP_PROD_DATA generate FLATTEN(group) AS categ_name, COUNT(PRODUCT_DATA) as cnt;

CATEG_JOINED = JOIN CATEG_COUNT BY categ_name, LKP_DATA BY categ_name USING 'replicated';

CATEG_PERCENT = FOREACH CATEG_JOINED GENERATE CATEG_COUNT::categ_name AS categ_name, CATEG_COUNT::cnt AS record_cnt, LKP_DATA::perc AS  percentage;

PRCNT_PROD_DATA = JOIN PRODUCT_DATA BY categ_name, CATEG_PERCENT BY categ_name;

PRCNT_PROD_DATA = FOREACH PRCNT_PROD_DATA GENERATE PRODUCT_DATA::categ_name AS categ_name, PRODUCT_DATA::product_titl AS product_titl, PRODUCT_DATA::impression_cnt AS impression_cnt, PRODUCT_DATA::click_through_cnt AS click_through_cnt,  CATEG_PERCENT::record_cnt*CATEG_PERCENT::percentage AS sample_size;

GRP_PRCNT_PROD_DATA = GROUP PRCNT_PROD_DATA BY categ_name;

ORDRD_PROD_LIST = FOREACH GRP_PRCNT_PROD_DATA {
                             SORTED_TOP_PROD = ORDER PRCNT_PROD_DATA BY impression_cnt DESC;
                             GENERATE flatten(SORTED_TOP_PROD);
                          }

GRP_PROD_LIST = GROUP ORDRD_PROD_LIST BY categ_name;

GRP_PRCNT_PROD_DATA = FOREACH GRP_PROD_LIST GENERATE flatten(Enumerate(ORDRD_PROD_LIST)) AS (categ_name, product_titl, impression_cnt, click_through_cnt,  sample_size, rnk);

SAMPLED_DATA = FILTER GRP_PRCNT_PROD_DATA BY rnk <= sample_size;

SAMPLED_DATA = FOREACH SAMPLED_DATA GENERATE categ_name, product_titl, impression_cnt, click_through_cnt, rnk;

DUMP SAMPLED_DATA;

Quick comment: If you switch to Pig 0.11, you can use the `RANK` operator instead of the Enumerate UDF. — cabad, Oct 07 '13 at 14:03

How to dynamically get Top N percent records within group in Pig

2 Answers2