Pig Unique Counts on Multiple Subsets of a Large Input

Question

I have a huge input on an HDFS and I would like to use Pig to calculate several unique metrics. To help explain the problem more easily, I assume the input file has the following schema:

userId:chararray, dimensionA_key:chararray, dimensionB_key:chararray, dimensionC_key:chararray, activity:chararray, ...

Each record represent an activity performed by that userId.

Based on the value in the activity field, this activity record will be mapped to 1 or more categories. There are about 10 categories in total.

Now I need to count the number of unique users for different dimension combinations (i.e. A, B, C, A+B, A+C, B+C, A+B+C) for each activity category.

What would be the best practices to perform such calculation?

I have tried several ways. Although I can get the results I want, it takes a very long time (i.e. days). What I found is most of the time is spent on the map phase. It looks like the script tries to load the huge input file every time it tries to calculate one unique count. Is there a way to improve this behavior?

I also tried something similar to below, but it looks like it reaches the memory cap for a single reducer and just stuck at the last reducer step.

source = load ... as (userId:chararray, dimensionA_key:chararray, dimensionB_key:chararray, dimensionC_key:chararray, activity:chararray, ...);
a = group source by (dimensionA_key, dimensionB_key);
b = foreach a {
    userId1 = udf.newUserIdForCategory1(userId, activity);
    -- this udf returns the original user id if the activity should be mapped to Category1 and None otherwise
    userId2 = udf.newUserIdForCategory2(userId, activity);
    userId3 = udf.newUserIdForCategory3(userId, activity);
    ...
    userId10 = udf.newUserIdForCategory10(userId, activity);
    generate FLATTEN(group), COUNT(userId1), COUNT(userId2), COUNT(userId3), ..., COUNT(userId10);
}
store b ...;

Thanks. T.E.

To help you with memory issues take a look at http://stackoverflow.com/questions/11999268/how-to-handle-spill-memory-in-pig/12003713#12003713 — alexeipab, May 07 '13 at 15:34

score 0 · Accepted Answer · answered May 06 '13 at 09:00

0

If I understand what you're looking for, it's probably the CUBE operation that allows grouping by many combinations - available in Pig 0.11.

If you can't use Pig 0.11, you may try to GROUP by every permutation and UNION the grouped results.

answered May 06 '13 at 09:00

SNeumann

1,158
9
12

Pig Unique Counts on Multiple Subsets of a Large Input

1 Answers1