1

I follow a help How to handle spill memory in pig from alexeipab, it really works fine, but I have another question now, same sample code:

pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);

pymt_grp_with_salt = GROUP pymt BY (key,salt)

results_with_salt = FOREACH pymt_grp {
    --distinct
    mid_set = FILTER pymt BY xxx=='abc';
    mid_set_result = DISTINCT mid_set.yyy;
    result = COUNT(mid_set_result)   
}

pymt_grp = GROUP results_with_salt BY key;

result = FOREACH pymt_grp {

   GENERATE SUM(results_with_salt.result); --it is WRONG!!
}

I can't use sum in that group, which it will be very different from result that calculated without salt.

is there any solution? if filter first, it will cost many JOIN job, and slow down the performance.

Community
  • 1
  • 1
mark
  • 292
  • 1
  • 6
  • 18

2 Answers2

1

For this to work, you need to have many to one relationship between mid_set.yyy and salt, so that same value for mid_set.yyy from different rows is mapped into the same value of salt. If it is not, than that value of mid_set.yyy will appear in different bags produced by GROUP pymt BY (key, salt), survive DISTINCT in different salts, thus are included multiple times in the final rollup. That is why you can get wrong results when using salts and COUNT of DISTINCT.

An easy way could be to replace salt with mid_set.yyy itself or to write a UDF/static method which calculates salt by taking hash of mid_set.yyy and does mod N, where N could be 1 to infinity, for best distribution N should be a prime number.

alexeipab
  • 3,609
  • 14
  • 16
0

Thanks alexeipab, you give me a great help, what i do as below

pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);

pymt = FOREACH pymt GENERATE *, (yyy%$prime_num) as salt;

pymt_grp_with_salt = GROUP pymt BY (key,salt);

It works!!

if yyy is num integer, you can use hash to convert string or others to a integer

mark
  • 292
  • 1
  • 6
  • 18