I follow a help How to handle spill memory in pig from alexeipab, it really works fine, but I have another question now, same sample code:
pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);
pymt_grp_with_salt = GROUP pymt BY (key,salt)
results_with_salt = FOREACH pymt_grp {
--distinct
mid_set = FILTER pymt BY xxx=='abc';
mid_set_result = DISTINCT mid_set.yyy;
result = COUNT(mid_set_result)
}
pymt_grp = GROUP results_with_salt BY key;
result = FOREACH pymt_grp {
GENERATE SUM(results_with_salt.result); --it is WRONG!!
}
I can't use sum in that group, which it will be very different from result that calculated without salt.
is there any solution? if filter first, it will cost many JOIN job, and slow down the performance.