I have a huge input on an HDFS and I would like to use Pig to calculate several unique metrics. To help explain the problem more easily, I assume the input file has the following schema:
userId:chararray, dimensionA_key:chararray, dimensionB_key:chararray, dimensionC_key:chararray, activity:chararray, ...
Each record represent an activity performed by that userId.
Based on the value in the activity field, this activity record will be mapped to 1 or more categories. There are about 10 categories in total.
Now I need to count the number of unique users for different dimension combinations (i.e. A, B, C, A+B, A+C, B+C, A+B+C) for each activity category.
What would be the best practices to perform such calculation?
I have tried several ways. Although I can get the results I want, it takes a very long time (i.e. days). What I found is most of the time is spent on the map phase. It looks like the script tries to load the huge input file every time it tries to calculate one unique count. Is there a way to improve this behavior?
I also tried something similar to below, but it looks like it reaches the memory cap for a single reducer and just stuck at the last reducer step.
source = load ... as (userId:chararray, dimensionA_key:chararray, dimensionB_key:chararray, dimensionC_key:chararray, activity:chararray, ...);
a = group source by (dimensionA_key, dimensionB_key);
b = foreach a {
userId1 = udf.newUserIdForCategory1(userId, activity);
-- this udf returns the original user id if the activity should be mapped to Category1 and None otherwise
userId2 = udf.newUserIdForCategory2(userId, activity);
userId3 = udf.newUserIdForCategory3(userId, activity);
...
userId10 = udf.newUserIdForCategory10(userId, activity);
generate FLATTEN(group), COUNT(userId1), COUNT(userId2), COUNT(userId3), ..., COUNT(userId10);
}
store b ...;
Thanks. T.E.