2

We're doing a simple pig join between a small table and a big skewed table. We cannot use "using skewed" due to another bug (pig skewed join with a big table causes "Split metadata size exceeded 10000000") :(

If we use the default mapred.job.shuffle.input.buffer.percent=0.70 some of our reducers fail in the shuffle stage:

org.apache.hadoop.mapred.Task: attempt_201305151351_21567_r_000236_0 : 
Map output copy failure : java.lang.OutOfMemoryError: GC overhead limit exceeded

If we change it to mapred.job.shuffle.input.buffer.percent=0.30 it finishes nicely, although in 2 hours (there are 3 lagging reducers out of the 1000 reducers we use), and we can see in the lagging reducers log something like this:

SpillableMemoryManager: first memory handler call- 
Usage threshold init = 715849728(699072K) used = 504241680(492423K) committed = 715849728(699072K) max = 715849728(699072K)

Why does this happen? How comes the SplilableMemoryManager doesn't protect us from failing when the shuffle input buffer is on 70%?

Community
  • 1
  • 1
ihadanny
  • 4,377
  • 7
  • 45
  • 76

1 Answers1

2

Generally speaking, mapred.job.shuffle.input.buffer.percent=0.70 will not trigger OutOfMemory error, because this configuration ensures at most 70% of reducer's heap is used to store shuffled data. However, there are two scenarios that may cause OutOfMemory error in my practice.

1) Your program has combine() function and your combine() is memory-consuming. So the memory usage may exceed 70% of heap in shuffle phase, which may cause OutOfMemory error. But in general, Pig does not have combine() in Join operator.

2) JVM manages the memory itself and divides its heap into Eden, S0, S1 and old space. S0 and S1 are used for GC. In some cases, S0 + S1 + partial shuffled data (70% heap) > heap size. So the OutOfMemory occurs.

As you mentioned, when mapred.job.shuffle.input.buffer.percent=0.30, only 30% heap is used for storing shuffled data, heap is hard to be full. I need the job's detailed configuration (such as Xmx), data size, and log to give you a more specific answer.

Speaking of the SpillableMemoryManager. The default collection data structure in Pig is a “Bag”. Bags are spillable, meaning that if there is not enough memory to hold all the tuples in a bag in RAM, Pig will spill part of the bag to disk. This allows a large job to make progress, albeit slowly, rather than crashing from “out of memory” errors. (This paragraph is from pig's blog)

However, shuffle phase is controlled by Hadoop itself, so that SpillableMemoryManager does not take effect in shuffle phase (exactly speaking, it can take effect in combine() which is used in Group By. But Join does not have combine()). SpillableMemoryManager is normally used in map(), combine(), reduce() functions. This is why SplilableMemoryManager doesn't protect us from failing when the shuffle input buffer is on 70%. Note that Hadoop does not hold all the shuffled data in memory, it will merge partial shuffled data onto disk if they are too large.

Lijie Xu
  • 250
  • 2
  • 9
  • Thanks for the very thorough explanation. So basically you're saying that SpillableMemoryManager has nothing to do with my problem as it's a pig feature, and my problem lies in the hadoop level beneath it, which is using its own RAM manager, right? Can you guess what's taking up so much memory in S0 and S1? If the shuffle stage uses a RAM manager, I would expect that this would be the only big element in the memory, and no other memory would be taken by big objects... – ihadanny Aug 14 '13 at 14:48
  • Yes, SpillableMemoryManager has nothing to do with your problem. It's a bit hard to explain clearly the relationship among S0, S1, Eden and Old space. When GC occurs, some alive objects in Eden are copied into S0/S1 or Old. So S0/S1/Old needs some free space to hold these objects. That's why JVM needs more memory than you think. The answer to the final question is normally yes. But if you change some code in ReduceTask.java, this assumption may break. – Lijie Xu Aug 15 '13 at 02:25
  • I've updated the answer above. Additional big objects may exist in shuffle phase if your combine() is memory-consuming. – Lijie Xu Aug 15 '13 at 07:15
  • I can't reproduce the error now because our cluster config has changed (yay!) but I'm pretty sure that the OOM hapenned in the ReducerTask and not in the MapperTask, so it can't be a combine() problem (combine runs in the mapper) – ihadanny Aug 15 '13 at 11:42
  • It doesn't matter. To be exact, combine() can also happen in shuffle phase in ReduceTask. If you take a look at "counters" in reducers, you may find "Combine input records" and "Combine output records" are not zero in some jobs. It means comebine() has performed. – Lijie Xu Aug 15 '13 at 12:38