0

My table (MyTable ~365 GB) contains 2 year data of customer behaviors. It is partitioned by day, and clustered by customer_id into 64 buckets. On average, one day contains 8 million of entries.

My task is to retrieve customers per day (~ 512 MB), and look back into their behaviors - e.g. number of purchases in last 2 years.

In my understanding, left semi join would be applicable here, such as:

    WITH TabA as (SELECT cid, NUM_PURCHASES from MyTable where dt>= '20161001' and dt <= '20181001'), 
TabB as (SELECT cid from MyTable where dt='20181001') 
    SELECT TabA.cid as ID,
    SUM(TabA.NUM_PURCHASES) as total_p
    FROM TabA LEFT SEMI JOIN TabB on (TabB.cid = TabA.cid) GROUP BY TabA.cid;

As my table is bucketed I heavily relied on join optimization advices posted in Hive join optimization. So the following parameters are set on Hive (note that tez does not work in my environment):

set hive.auto.convert.join=true;
SET hive.variable.substitute.depth=150;
set hive.auto.convert.join=true;
set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.skewjoin=true;
set hive.enforce.bucketing = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.exec.parallel=true;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
set hive.cbo.enable=true;
SET mapred.child.java.opts=-Xmx4G -XX:+UseConcMarkSweepGC  -XX:-UseGCOverheadLimit;
set mapreduce.map.memory.mb=9216;
set mapreduce.reduce.memory.mb=9216;

Last three lines are added as I had memory issues.

My query fails on the first job. Mappers get executed until 100% and once reducers (seem to) start the job gets reset and fails again. Cluster manager reports java heap space memory issues. I have also tried to reduce memory per mapper (6 Gb, 4 Gb) and reducer (8 Gb, 7 Gb, 6 Gb) - all combinations, but I got the same error.

Could someone give me an insight on a) how to make this work, b) how much space should I be allocating per mapper/reducer to and c) if my query can be optimized (i.e. such that group by cid is done before left semi join)?

Alex
  • 607
  • 5
  • 10
  • Check your `yarn-site.xml` file, in particular value of`yarn.nodemanager.resource.memory-mb`. You cannot assign memory greater than this value. Another thing, I would recommend you not to use subqueries for large data, it's better to make some temporary stage tables. – serge_k Jan 17 '19 at 06:53
  • Thank you @serge_k. I have not checked yarn-site.xml - instead I have created a temporary table with 2 year data (grouped by cid), and I am able to complete the task above with set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=8192; – Alex Jan 18 '19 at 16:56

0 Answers0