We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table.
A regular join finishes in 2 hours (after some tweaking). We tried using skewed
and been able to improve the performance to 20 minutes.
HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job:
Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner]
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:817) [ScriptRunner]
This is reproducible every time we try using skewed
, and does not happen when we use the regular join.
we tried setting mapreduce.jobtracker.split.metainfo.maxsize=-1
and we can see it's there in the job.xml file, but it doesn't change anything!
What's happening here? Is this a bug with the distribution sample created by using skewed
? Why doesn't it help changing the param to -1
?