I hava a table zhihu_answer_increment
, it was partitioned by column ym
. When I execute query select distinct(ym) from zhihu.zhihu_answer_increment;
, it took over 1 min to finish. During the process, hive launched a map-reduce job. here is the log:
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.59 sec HDFS Read: 14969 HDFS Write: 106 HDFS EC Read: 0 SUCCESS
INFO : Total MapReduce CPU Time Spent: 3 seconds 590 msec
INFO : Completed executing command(queryId=hive_20191015113300_a6f58bad-f35b-4243-890a-a0d9ba9a5210); Time taken: 95.048 seconds
INFO : OK
In comparison, show partitions zhihu_answer_increment;
return the result much faster (just few seconds). But I need to take select distinct(ym) from zhihu.zhihu_answer_increment
as a sub-query.
So how can I optimise this query? and I don't understand why it launched a mapreduce job, as far as I'm concerned, only checking the partition directory is enough for this query. Or maybe my consideration is too simple.