how to remove the limit of partitions per node in Hive?

Question

I have a partitioned table in Hive. And for each node I have set the partitions limit to 2000.

set hive.exec.max.dynamic.partitions.pernode=2000

Now that after 2000 is reached, I am facing a problem. So i am trying to understand if there a possibility to remove this limitation?

Simply I don't want to set any value for hive.exec.max.dynamic.partitions.pernode It should handle any number of partitions.

So could someone please help me on this?

Mike Gan · Answer 1 · 2016-10-04T11:41:41.137

3

I met this problem before. Add distribute by partition_column at the end of the SQL .

insert overwrite table table_A partition (date_id)
select xxxxxxxx
from table_B
distribute by date_id;

with "distribute by", same date_id value will be shuffled into one reducer. so a reducer may process several date_id instead of random date_id ( this may include all date_id ) .

edited Oct 04 '16 at 11:41

answered Sep 26 '16 at 01:39

Mike Gan

339
2
8

1

How does `distribute by` connected with dynamic partitions limit? Distribute or not distribute the number of partitions dynamically created will be the same. Distribute by can help to reduce memory consumption. – leftjoin Sep 26 '16 at 08:13
distribute by partition_column can redistribute data and the data with same partition will be shuffled into one reducer . – Mike Gan Sep 26 '16 at 08:25
1

This will not change the number of partitions per node – leftjoin Sep 26 '16 at 08:36
it's too hard for me to explain this in english, may be you can write a sample SQL. – Mike Gan Sep 26 '16 at 08:50
If your data contain 15000 different partition keys, how many partitions will be created after load? With `distribute by` and without it? – leftjoin Sep 26 '16 at 09:17
1

both will create 15000 partitions but without distribute by, partition_A may created by 100 node, each node generate 1% data and each node will +1 partition in hive.exec.max.dynamic.partitions.pernode , with distribute by, all data in partition_A will be process by one node and this node only process a certain data volume. so , this will reduce partitions created by this node . – Mike Gan Sep 26 '16 at 09:43
Thanks for your explanation! Finally I have found the same in the documentation: https://cwiki.apache.org/confluence/display/Hive/Tutorial – leftjoin Sep 26 '16 at 11:08
@MikeGan : Can you add this explanation in the comment into the answer, so that others could easily read it in the answer at a go, than expanding the comments – K S Nidhin Sep 26 '16 at 19:29
yes, please edit your answer and I will undo my downwote – leftjoin Sep 29 '16 at 19:32

score 1 · Answer 2 · answered Sep 25 '16 at 18:28

As far as I know, that can't be done, Hive enforces a limit in the number of dynamic partitions that it can create. Up to my knowledge, this limitation has to do with the fact that each partition is stored in a separate HDFS directory, so the number of them is somehow bounded in advance to prevent performance issues.

how to remove the limit of partitions per node in Hive?

2 Answers2