Why select distinct partitioned column is very slow?

Question

I hava a table zhihu_answer_increment, it was partitioned by column ym. When I execute query select distinct(ym) from zhihu.zhihu_answer_increment;, it took over 1 min to finish. During the process, hive launched a map-reduce job. here is the log:

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.59 sec   HDFS Read: 14969 HDFS Write: 106 HDFS EC Read: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 590 msec
INFO  : Completed executing command(queryId=hive_20191015113300_a6f58bad-f35b-4243-890a-a0d9ba9a5210); Time taken: 95.048 seconds
INFO  : OK

In comparison, show partitions zhihu_answer_increment; return the result much faster (just few seconds). But I need to take select distinct(ym) from zhihu.zhihu_answer_increment as a sub-query.

So how can I optimise this query? and I don't understand why it launched a mapreduce job, as far as I'm concerned, only checking the partition directory is enough for this query. Or maybe my consideration is too simple.

score 3 · Answer 1 · answered Oct 15 '19 at 15:33

3

If you can use shell, then get partition list into variable using SHOW PARTITIONS (works fast) and transform it to the comma-separated list using AWK. Then parametrize your hive script using the variable with partition_list:

Something like this:

partition_list=$(hive -S -e "show partitions your_table;"  |  awk -vq="'" -F "=" 'f&&!NF{exit}{f=1}f{printf c q $2 q}{c=","}')


hive -e "select 1 from your_table where partition_column in (${partition_list}) limit 1"

answered Oct 15 '19 at 15:33

leftjoin

36,950
8
57
116

Thanks a lot for your answer. Is there any solution in SQL mode? In my scenario, it's not convenient to run in shell. – DennisLi Oct 16 '19 at 01:50
@DennisLi I faced w the same problem and could not find SQL solution. Using Hive 1.2 I tried to employ fetch-only task but it does not work with distinct unfortunately – leftjoin Oct 16 '19 at 04:09
Alright, I 'll take a try for that. Thanks @leftjoin. – DennisLi Oct 16 '19 at 05:41

Why select distinct partitioned column is very slow?

1 Answers1

Linked