joining hive partitioned , bucketed table with only bucketed table (not partitioned table) in hive

Question

i have 2 tables:

q6_cms_list_key1 (bucketed by cm and se) partitioned by tr_dt ... 99 000 000 000 rows q6_cm_first_visit (bucketed by cm and se) 25 000 000 000 rows

making another table using below conditions

     insert into table q6_cm_first_visit1 PARTITION (trans) 
     select distinct 
            b.se10, b.dealer_id, b.terminal_id, b.se, 
            b.comp_start_n, b.comp_end_n, b.latest_date,
            b.cm,a.first_visit_date,b.trans 
       from q6_cm_first_visit a 
            inner join q6_cms_list_key1 b 
            on b.trans BETWEEN DATE_SUB('${hiveconf:run.date}', 180) AND '${hiveconf:run.date}' 
               and a.cm = b.cm and a.se = b.se;

query is taking lots of time. How to reduce time and whether non partitioned/bucketed table be joined with partitioned/bucketed table?

formatted plan is below 

1   STAGE DEPENDENCIES:
2   Stage-1 is a root stage
3   Stage-2 depends on stages: Stage-1
4   Stage-3 depends on stages: Stage-2
5   Stage-0 depends on stages: Stage-3
6   Stage-4 depends on stages: Stage-0
7   STAGE PLANS:
8   Stage: Stage-1
9   Map Reduce
10  Map Operator Tree:
11  TableScan
12  alias: a
13  Statistics: Num rows: 25638813900 Data size: 10691743838084 Basic stats: COMPLETE Column stats: NONE
14  Filter Operator
15  predicate: (cm is not null and se is not null) (type: boolean)
16  Statistics: Num rows: 25638813900 Data size: 10691743838084 Basic stats: COMPLETE Column stats: NONE
17  Reduce Output Operator
18  key expressions: cm (type: string), eff_se (type: bigint)
19  sort order: ++
20  Map-reduce partition columns: cm (type: string), se (type: bigint)
21  Statistics: Num rows: 25638813900 Data size: 10691743838084 Basic stats: COMPLETE Column stats: NONE
22  value expressions: first_visit_date (type: string)
23  TableScan
24  alias: b
25  Statistics: Num rows: 91896551441 Data size: 52063175338060 Basic stats: COMPLETE Column stats: NONE
26  Filter Operator
27  predicate: (cm is not null and se is not null and trans_dt BETWEEN null AND '${hiveconf:run.date}') (type: boolean)
28  Statistics: Num rows: 1 Data size: 566 Basic stats: COMPLETE Column stats: NONE
29  Reduce Output Operator
30  key expressions: cm (type: string), se (type: bigint)
31  sort order: ++
32  Map-reduce partition columns: cm (type: string), se (type: bigint)
33  Statistics: Num rows: 1 Data size: 566 Basic stats: COMPLETE Column stats: NONE
34  value expressions: se10 (type: string), dealer_id (type: string), terminal_id (type: string), comp_start_n (type: string), comp_end_n (type: string), latest_date (type: date), trans_dt (type: string)
35  Reduce Operator Tree:
36  Join Operator
37  condition map:
38  Inner Join 0 to 1
39  keys:
40  0 cm (type: string), se (type: bigint)
41  1 cm (type: string), se (type: bigint)
42  outputColumnNames: _col5, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17
43  Statistics: Num rows: 28202695901 Data size: 11760918476803 Basic stats: COMPLETE Column stats: NONE
44  Select Operator
45  expressions: _col5 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: bigint), _col13 (type: string), _col14 (type: string), _col15 (type: date), _col16 (type: string), _col17 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: bigint), _col13 (type: string), _col14 (type: string), _col15 (type: date), _col16 (type: string), _col5 (type: string), _col17 (type: string)
46  outputColumnNames: _col5, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col5, _col17
47  Statistics: Num rows: 28202695901 Data size: 11760918476803 Basic stats: COMPLETE Column stats: NONE
48  Group By Operator
49  keys: _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: bigint), _col13 (type: string), _col14 (type: string), _col15 (type: date), _col16 (type: string), _col5 (type: string), _col17 (type: string)
50  mode: hash
51  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
52  Statistics: Num rows: 28202695901 Data size: 11760918476803 Basic stats: COMPLETE Column stats: NONE
53  File Output Operator
54  compressed: true
55  table:
56  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
57  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
58  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
59  Stage: Stage-2
60  Map Reduce
61  Map Operator Tree:
62  TableScan
63  Reduce Output Operator
64  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: string), _col5 (type: string), _col6 (type: date), _col7 (type: string), _col8 (type: string), _col9 (type: string)
65  sort order: ++++++++++
66  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: string), _col5 (type: string), _col6 (type: date), _col7 (type: string), _col8 (type: string), _col9 (type: string)
67  Statistics: Num rows: 28202695901 Data size: 11760918476803 Basic stats: COMPLETE Column stats: NONE
68  Reduce Operator Tree:
69  Group By Operator
70  keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: bigint), KEY._col4 (type: string), KEY._col5 (type: string), KEY._col6 (type: date), KEY._col7 (type: string), KEY._col8 (type: string), KEY._col9 (type: string)
71  mode: mergepartial
72  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
73  Statistics: Num rows: 14101347950 Data size: 5880459238192 Basic stats: COMPLETE Column stats: NONE
74  File Output Operator
75  compressed: true
76  table:
77  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
78  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
79  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
80  Stage: Stage-3
81  Map Reduce
82  Map Operator Tree:
83  TableScan
84  Reduce Output Operator
85  sort order:
86  Map-reduce partition columns: _col9 (type: string)
87  Statistics: Num rows: 14101347950 Data size: 5880459238192 Basic stats: COMPLETE Column stats: NONE
88  value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: string), _col5 (type: string), _col6 (type: date), _col7 (type: string), _col8 (type: string), _col9 (type: string)
89  Reduce Operator Tree:
90  Select Operator
91  expressions: UDFToLong(VALUE._col0) (type: bigint), VALUE._col1 (type: string), VALUE._col2 (type: string), VALUE._col3 (type: bigint), VALUE._col4 (type: string), VALUE._col5 (type: string), VALUE._col6 (type: date), VALUE._col7 (type: string), VALUE._col8 (type: string), VALUE._col9 (type: string)
92  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
93  Statistics: Num rows: 14101347950 Data size: 5880459238192 Basic stats: COMPLETE Column stats: NONE
94  File Output Operator
95  compressed: false
96  Statistics: Num rows: 14101347950 Data size: 5880459238192 Basic stats: COMPLETE Column stats: NONE
97  table:

please provide EXPLAIN command output and also please provide more details about execution: which vertex is running slow if it is on Tez, or details about MR execution — leftjoin, Dec 28 '20 at 12:49
Are you saying that you have 99 000 000 000 rows? Please don't use Indian words like "crore" — James Z, Dec 28 '20 at 16:01

leftjoin · Answer 1 · 2020-12-29T13:46:39.600

Some suggestions:

Improve filtering. Calculate date_sub outside the script and pass already calculated date if possible. Functions in predicates may prevent partition pruning. EXPLAIN DEPENDENCY provides more information about which partitions are being read. Check that partition pruning works.
Improve Join strategy. What is worth trying is the Sort Merge Bucket Join or Sort Merge Bucket Map Join. If both tables are bucketed and sorted by the same columns and have the same number of buckets, you can try it adding these settings:

    set hive.auto.convert.sortmerge.join=true;
    set hive.optimize.bucketmapjoin = true;
    set hive.optimize.bucketmapjoin.sortedmerge = true;
    set hive.auto.convert.sortmerge.join.noconditionaltask=true;
    set hive.auto.convert.join.noconditionaltask.size = 10000000; --can be increased
    set hive.mapjoin.smalltable.filesize=10000000; --can be increased

The last 2 settings control what size table can fit in memory. Not sure will it work for tables like yours or not, setting too big size may result in OOM Exception. But even without Map join conversion, Sort Merge Bucket Join should improve the performance significantly. Check the plan after adding these settings, it should be Sort Merge Bucket Join Operator instead of Join Operator. See more details about all settings here: LanguageManual JoinOptimization

Vectorizing and Tez. You are running on MR and without vectorizing. You will benefit from Tez and vectorizing, try these settings:

    set hive.execution.engine=tez;
    SET hive.optimize.ppd=true;          --works for ORC
    SET hive.vectorized.execution.enabled=true;
    SET hive.vectorized.execution.reduce.enabled=true;

Final reducer improvement. If the output table is also partitioned, check if adding distribute by <partition key> at the end will improve performance or not. If partition key is evenly distributed without skew, adding distribute by will improve performance. Try to add DISTRIBUTE BY trans at the and of query.
Early aggregation before join. For example if first_visit table contains more than one row per join key and duplicates rows after join, de-duplicate it in the subquery before join using row_number or distinct, etc. This may improve performance but also may prevent sort-merge-bucket-join.

Better try all these improvements separately to check how each of them affect performance and finally combine those which improve performance.

thanks @leftjoin below is the full query insert into table q6_cm_first_visit1 PARTITION (trans) select distinct b.se10, b.dealer_id, b.terminal_id, b.se, b.comp_start_n, b.comp_end_n, b.latest_date, b.cm,a.first_visit_date,b.trans from q6_cm_first_visit a inner join q6_cms_list_key1 b on b.trans BETWEEN DATE_SUB('${hiveconf:run.date}', 180) AND '${hiveconf:run.date}' and a.cm = b.cm and a.se = b.se; q6_cms_list_key1 (bucketed by cm and se) partitioned by trans ... 99 000 000 000 rows q6_cm_first_visit (bucketed by cm and se) 25 000 000 000 rows — vashi, Dec 28 '20 at 19:58

joining hive partitioned , bucketed table with only bucketed table (not partitioned table) in hive

1 Answers1