-1

Spark how to decide how many repartitions to do for any RDD. RDD repartition() takes the number how to come up with the number?

G G
  • 1,049
  • 4
  • 17
  • 26
  • 1
    Possible duplicate of [Number of partitions in RDD and performance in Spark](http://stackoverflow.com/questions/35800795/number-of-partitions-in-rdd-and-performance-in-spark) – matfax Mar 12 '17 at 00:04

1 Answers1

1

Rule of thumb while deciding partitions .

  1. A partition size should be less than 2GB( this restriction comes from spark code ) .

  2. In Spark try to keep the partition size = Map Split size = HDFS default block size. Do remember unlike MR in spark num reducer task >= num mappers

  3. If number of partitions is around 2000 then increase the numPartitions > 2000. As spark applies different logic for partition < 2000 and > 2000

eliasah
  • 39,588
  • 11
  • 124
  • 154
KrazyGautam
  • 2,839
  • 2
  • 21
  • 31