Hive Distinct Query taking more time

Asked Oct 01 '19 at 17:05

Active Oct 01 '19 at 17:05

Viewed 245 times

I have partitioned table, Table Structure

create table tab1 
(
col1 int,
col2 string,
...
col50 int,
col51 int
)
partitioned by 
(col50 int, col51 int)
stored as orc;

Currently we have ~17000 partitions and each partition will have minimum of ~50k records.

Below Query is taking more time ~ 90Mins

SELECT DISTINCT col2 FROM tab1
select col2 from (select col2, row_number() over (partition by col2 order by col3) as rnk from tab1) t1 where t1.rnk=1

Is there a way we can reduce the execution time, Thanks in advance

asked Oct 01 '19 at 17:05

Vijiy

You have shown two queries. – Gordon Linoff Oct 01 '19 at 17:06
Yes, tried analytical function to get the distinct records, both the queries are taking more time – Vijiy Oct 01 '19 at 17:19
try running ANALYZE TABLE Table1 COMPUTE STATISTICS; and then try running your query. Refer this for more information - https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ANALYZETABLE%3Ctable1%3ECACHEMETADATA – sangam.gavini Oct 01 '19 at 18:14
The table is overpartitioned IMO, and partitioning scheme does not match use-case. Full-scan, all partitions files are being read. – leftjoin Feb 12 '20 at 05:58

0 Answers0