I have a pig script running on an emr cluster (emr-5.4.0) using a custom UDF. The UDF is used to lookup some dimensional data for which it imports a (somewhat) large amout of text data.
In the pig script, the UDF is used as follows:
DEFINE LookupInteger com.ourcompany.LookupInteger(<some parameters>);
The UDF stores some data in Map<Integer, Integer>
On some input data the aggregation fails with an exception as follows
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.split(String.java:2377)
at java.lang.String.split(String.java:2422)
[...]
at com.ourcompany.LocalFileUtil.toMap(LocalFileUtil.java:71)
at com.ourcompany.LookupInteger.exec(LookupInteger.java:46)
at com.ourcompany.LookupInteger.exec(LookupInteger.java:19)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextInteger(POUserFunc.java:379)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.genericGetNext(POBinCond.java:76)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNextInteger(POBinCond.java:118)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
This does not occur when the pig aggregation is run with mapreduce
, so a workaround for us is to replace pig -t tez
with pig -t mapreduce
.
As i'm new to amazon emr, and pig with tez, i'd appreciate some hints on how to analyse or debug the issue.
EDIT: It looks like a strange runtime behaviour while running the pig script on tez stack.
Please note that the pig script is using
- replicated joins (the smaller relations to be joined need to fit into memory) and
- the already mentioned UDF, which is initialising a
Map<Integer, Interger>
producing the aforementioned OutOfMemoryError.