I'm trying to a LEFT OUTER JOIN operation on 2 of my tables in hive. Could understand that we have include filter conditions along with the join conditions in case of joins, imitting them from where conditions to avoid full table scans. Reference: https://gist.github.com/randyzwitch/9abeb66d8637d1a0007c
Inspite of doing this, my query is resulting in a high number of mappers and reducers as if it is doing a full table scan.
Here is my query and explain plan. I'm not good at understanding this explain plan. m.date_id
and d.REC_CREATED_DATE
are the partitioned columns in the respective tables so it should actually scan only these partitions.
Any suggestions to improve my query would be of great help.
hive> EXPLAIN SELECT m.execution_id
> ,m.operation_name
> ,m.return_code
> ,m.explanation
> ,d.REC_CREATED_DATE
> FROM web_log_master m LEFT OUTER JOIN web_log_detail d
> on (m.execution_id = d.execution_id AND m.date_id='2015-07-14' and d.REC_CREATED_DATE='2015-07-14') ;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME web_log_master) m) (TOK_TABREF (TOK_TABNAME web_log_detail) d) (and (AND (= (. (TOK_TABLE_OR_COL m) execution_id) (. (TOK_TABLE_OR_COL d) execution_id)) (= (. (TOK_TABLE_OR_COL m) date_id) '2015-07-14')) (= (. (TOK_TABLE_OR_COL d) REC_CREATED_DATE) '2015-07-14')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) execution_id)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) operation_name)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) return_code)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) explanation)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL d) REC_CREATED_DATE)))))
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
d
TableScan
alias: d
Reduce Output Operator
key expressions:
expr: execution_id
type: string
sort order: +
Map-reduce partition columns:
expr: execution_id
type: string
tag: 1
value expressions:
expr: rec_created_date
type: string
m
TableScan
alias: m
Reduce Output Operator
key expressions:
expr: execution_id
type: string
sort order: +
Map-reduce partition columns:
expr: execution_id
type: string
tag: 0
value expressions:
expr: execution_id
type: string
expr: operation_name
type: string
expr: return_code
type: string
expr: explanation
type: string
expr: date_id
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3}
1 {VALUE._col3}
filter predicates:
0 {(VALUE._col13 = '2015-07-14')}
1
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col2, _col3, _col19
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: string
expr: _col19
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 13.616 seconds, Fetched: 90 row(s)