0

when i run this sql, it takes about half a hour and it's unacceptable. the source table have 20M records.

each rank() create a mapreduce job running for 7 minutes with all the data in one reduce. how to optimize this sql? can these rank mapreduce run in parallel?

create table tmp.xxxx  as                                       
SELECT t.end_user_id,   
t.pm_net_amt,    
t.parnt_ordr_num,
rank() over(order by t.pm_net_amt) pm_net_amt_rank,
rank() over(order by t.parnt_ordr_num) parnt_ordr_num_rank, 
rank() over(order by t.cat_lvl3_num)  cat_lvl3_num_rank,
rank() over(order by t.discount_rate) discount_rate_rank 
from  bi_tmp.bbbbbb t; 
Mike Gan
  • 339
  • 2
  • 8

1 Answers1

0

From How to make hive run mapreduce jobs concurrently?:

In hive-default.xml, there is a property named "hive.exec.parallel" which could enable execute job in parallel. The default value is "false". You can change it to "true" to acquire this ability. You can use another property "hive.exec.parallel.thread.number" to control how many jobs at most can be executed in parallel.

That may cause each of the ranks to be run in parallel, but I have not tested.

Community
  • 1
  • 1
Stephen ODonnell
  • 4,441
  • 17
  • 19