2

My code like like this:

pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);

pymt_grp = GROUP pymt BY key

results = FOREACH pymt_grp {

      /*
       *   some kind of logic, filter, count, distinct, sum, etc.
       */
}

But now I find many logs like that:

org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of 207012796 bytes from 1 objects. init = 5439488(5312K) used = 424200488(414258K) committed = 559284224(546176K) max = 559284224(546176K)

Actually I find the cause, the majority reason is that there is a "hot" key, some thing like key=0 as ip address, but I don't want to filter this key. is there any solution? I have implemented algebraic and accumulator interface in my UDF.

mark
  • 292
  • 1
  • 6
  • 18
  • That logs looks like to slow down aggregation. if I filter this hot key, it may cost 5 mins, but if not filtered, it will cost more than 2 hours. – mark Aug 17 '12 at 03:13

1 Answers1

6

I had similar issues with heavily skewed data or DISTINCT nested in FOREACH (as PIG will do an in memory distinct). The solution was to take the DISTINCT out of the FOREACH as an example see my answer to How to optimize a group by statement in PIG latin?

If you do not want to do DISTINCT before your SUM and COUNT than I would suggest to use 2 GROUP BY. The first one groups on Key column plus another column or random number mod 100, it acts as a Salt (to spread the data of a single key into multiple Reducers). Than second GROUP BY just on Key column and calculate the final SUM of the group 1 COUNT or Sum.

Ex:

inpt = load '/data.csv' using PigStorage(',') as (Key, Value);
view = foreach inpt generate Key, Value, ((int)(RANDOM() * 100)) as Salt;

group_1 = group view by (Key, Salt);
group_1_count = foreach group_1 generate group_1.Key as Key, COUNT(view) as count;

group_2 = group group_1_count by Key;
final_count = foreach group_2 generate flatten(group) as Key, SUM(group_1_count.count) as count;
Community
  • 1
  • 1
alexeipab
  • 3,609
  • 14
  • 16
  • Thanks to Alexeipab, thing is done by following your suggestion. – mark Aug 20 '12 at 06:28
  • Interesting. I have the same SpillableMemoryManager log message, however I do not have any DISTINCT clauses. I do have JOIN, GROUP, inner ORDER by and inner LIMIT. – Marquez Sep 04 '13 at 13:46
  • GROUP will create a BAG that is too big for the inner ORDER to sort in the memory. Any inner operation (DISTINCT or ORDER) on BAGs that exceed I think 90Mb will throw the error. In your case you can find min in salted groups and that min of min. Use GROUP BY (id, SALT) to find min/max, this will create more BAGs that are smaller and fit into RAM. Than flatten and again GROUP BY (id) – alexeipab Sep 05 '13 at 16:16