Why the number of partitions is decided by a splitsize of 32MB?

Question

I have an input file of 849MB. When I read this file in pyspark shell using sc.textFile() and check the no. of partitions, it is 27. I have another file of size 2.60GB and for this file the no. of partitions is 84. It seems that value of dfs.block.size is 32MB which satisfies all these values. I am running locally with 4 cores.

But when I checked dfs.block.size it was 128MB. I don't know what's happening and how my pyspark shell is calculating the number of partitions.

score 0 · Accepted Answer · answered Apr 10 '22 at 15:58

0

The number looks correct, don't forget the number of cores is also a factor here: you have 4 cores, so 128/4 = 32

answered Apr 10 '22 at 15:58

pltc

5,836
1
13
31

But if you calculate the splitSize using the formula in this [post](https://stackoverflow.com/questions/69715907/understanding-the-number-of-partitions-created-by-spark) ,the splitSize is 128MB and according to that the no. of partitions should be calculated. – Shanif Ansari Apr 11 '22 at 17:28
That's funny you reference my post :) However, [`sc.textFile`](https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L923) used [different formula](https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L2519) to determine the number of partitions (compare with `csv` or `parquet`). – pltc Apr 12 '22 at 04:29

Why the number of partitions is decided by a splitsize of 32MB?

1 Answers1