rank() 20M records in hive

Question

when i run this sql, it takes about half a hour and it's unacceptable. the source table have 20M records.

each rank() create a mapreduce job running for 7 minutes with all the data in one reduce. how to optimize this sql? can these rank mapreduce run in parallel?

create table tmp.xxxx  as                                       
SELECT t.end_user_id,   
t.pm_net_amt,    
t.parnt_ordr_num,
rank() over(order by t.pm_net_amt) pm_net_amt_rank,
rank() over(order by t.parnt_ordr_num) parnt_ordr_num_rank, 
rank() over(order by t.cat_lvl3_num)  cat_lvl3_num_rank,
rank() over(order by t.discount_rate) discount_rate_rank 
from  bi_tmp.bbbbbb t;

score 0 · Answer 1 · edited May 23 '17 at 12:01

0

From How to make hive run mapreduce jobs concurrently?:

In hive-default.xml, there is a property named "hive.exec.parallel" which could enable execute job in parallel. The default value is "false". You can change it to "true" to acquire this ability. You can use another property "hive.exec.parallel.thread.number" to control how many jobs at most can be executed in parallel.

That may cause each of the ranks to be run in parallel, but I have not tested.

edited May 23 '17 at 12:01

Community

1
1

answered Sep 08 '16 at 13:22

Stephen ODonnell

4,441
17
19

the parameter hive.exec.parallel is already set TRUE and hive.exec.parallel.thread.number = 8 . – Mike Gan Sep 09 '16 at 00:44

rank() 20M records in hive

1 Answers1