5

We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table. A regular join finishes in 2 hours (after some tweaking). We tried using skewed and been able to improve the performance to 20 minutes.

HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job:

Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner]
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:817) [ScriptRunner]

This is reproducible every time we try using skewed, and does not happen when we use the regular join.

we tried setting mapreduce.jobtracker.split.metainfo.maxsize=-1 and we can see it's there in the job.xml file, but it doesn't change anything!

What's happening here? Is this a bug with the distribution sample created by using skewed? Why doesn't it help changing the param to -1?

ihadanny
  • 4,377
  • 7
  • 45
  • 76
  • decided to file a jira bug: https://issues.apache.org/jira/browse/PIG-3411 , will update – ihadanny Aug 06 '13 at 06:31
  • we found that changing mapreduce.jobtracker.split.metainfo.maxsize is known not to work in the job level, only in the jobTracker level, see here: https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/UWBMKplvGkg – ihadanny Aug 06 '13 at 13:49
  • did you ever find a solution to this problem? We're facing a similar issue. – KennethJ May 31 '16 at 04:01
  • @KennethJ, I don't think so, and the bug seems still open.. – ihadanny May 31 '16 at 13:03

2 Answers2

1

Small table of 1MB is small enough to fit into memory, try replicated join. Replicated join is Map only, does not cause Reduce stage as other types of join, thus is immune to the skew in the join keys. It should be quick.

big = LOAD 'big_data' AS (b1,b2,b3);
tiny = LOAD 'tiny_data' AS (t1,t2,t3);
mini = LOAD 'mini_data' AS (m1,m2,m3);
C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';

Big table is always the first one in the statement.

UPDATE 1: If small table in its original form does not fit into memory,than as a work around you would need to partition your small table into partitions that are small enough to fit into memory and than apply the same partitioning to the big table, hopefully you could add the same partitioning algorithm to the system which creates big table, so that you do not waste time repartitioning it. After partitioning, you can use replicated join, but it will require running pig script for each partition separately.

alexeipab
  • 3,609
  • 14
  • 16
  • nice idea, but the small table isn't 1MB (edited question) and won't fit in the hadoop cache (tried it) – ihadanny Jun 18 '13 at 09:34
  • Updated the answer. See Update 1. – alexeipab Jun 18 '13 at 09:48
  • Thanks again, but I'm looking for an explanation for the original problem. This is a cool workaround but I'm not going for it till I understand what's wrong with the conventional join – ihadanny Jun 19 '13 at 07:31
  • In this case I am afraid you would need to download the source code for the Pig version you have, search for "Split metadata size exceeded" and analyse the code. – alexeipab Jun 20 '13 at 16:37
1

In newer versions of Hadoop (>=2.4.0 but maybe even earlier) you should be able to set the maximum split size at the job level by using the following configuration property:

mapreduce.job.split.metainfo.maxsize=-1

Mass Dosage
  • 85
  • 1
  • 6