Spark how to decide how many repartitions to do for any RDD. RDD repartition() takes the number how to come up with the number?
Asked
Active
Viewed 497 times
-1
-
1Possible duplicate of [Number of partitions in RDD and performance in Spark](http://stackoverflow.com/questions/35800795/number-of-partitions-in-rdd-and-performance-in-spark) – matfax Mar 12 '17 at 00:04
1 Answers
1
Rule of thumb while deciding partitions .
A partition size should be less than 2GB( this restriction comes from spark code ) .
In Spark try to keep the partition size = Map Split size = HDFS default block size. Do remember unlike MR in spark num reducer task >= num mappers
If number of partitions is around 2000 then increase the numPartitions > 2000. As spark applies different logic for partition < 2000 and > 2000

eliasah
- 39,588
- 11
- 124
- 154

KrazyGautam
- 2,839
- 2
- 21
- 31