Here is my hive query
Insert into schemaB.employee partition(year)
select * from schemaA.employee;
Below is the Query Execution plan produced by this query.
hive> explain <query>;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: employee
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: Col1 (type: binary), col2 (type: binary), col3 (type: array<string>), year (type: int)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col3 (type: int)
sort order: +
Map-reduce partition columns: _col3 (type: int)
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: binary), _col1 (type: binary), _col2 (type: array<string>), _col3 (type: int)
Reduce Operator Tree:
Extract
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: schemaB.employee
Stage: Stage-0
Move Operator
tables:
partition:
year
replace: false
table:
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: schemaB.employee
Stage: Stage-2
Stats-Aggr Operator
I have two questions related to the query execution plan:
- Why is there a reduce step in the query plan? In my understanding, all it needs to do is copy the data from one HDFS location to another, which can be achieved by mappers alone. Is the reduce step has something to do with the partitions present in the table?
- What is the Stats-Aggr Operator step present in Stage 2? I am not able to find a relevant documentation explaining this.