Say I have 100 mappers running in parallel and there are total 500 mappers running.
Input size received by each mapper is almost same and the processing time each mapper should take should be more or less identical.
But say first 100 mappers finishes in 20 minutes, the next 100 mappers take like 25-30 minutes and the next batch of 100 mappers take around 40-50 minutes each. And then later we get GC overhead error.
Why is this happening?
I have following configurations already set:
<property><name>mapred.child.java.opts</name><value>-Xmx4096m</value></property>
<property><name>mapred.job.reuse.jvm.num.tasks</name><value>1</value></property>
What else can be done here?