17

Version: Spark 1.6.2, Scala 2.10

I am executing below commands In the spark-shell. I am trying to see the number of partitions that Spark is creating by default.

val rdd1 = sc.parallelize(1 to 10)
println(rdd1.getNumPartitions) // ==> Result is 4

//Creating rdd for the local file test1.txt. It is not HDFS.
//File content is just one word "Hello"
val rdd2 = sc.textFile("C:/test1.txt")
println(rdd2.getNumPartitions) // ==> Result is 2

As per the Apache Spark documentation, the spark.default.parallelism is the number of cores in my laptop (which is 2 core processor).

My question is : rdd2 seem to be giving the correct result of 2 partitions as said in the documentation. But why rdd1 is giving the result as 4 partitions ?

eliasah
  • 39,588
  • 11
  • 124
  • 154
Sri
  • 623
  • 3
  • 9
  • 22
  • I just observed that when I try to execute the same(val rdd1 = sc.parallelize(1 to 10)) in my IntelliJ IDE project and try to fetch the number of partiotions, I am getting 2 partitions as a result. Not sure why it is giving such a result in spark-shell. – Sri May 27 '17 at 23:05

1 Answers1

39

The minimum number of partitions is actually a lower bound set by the SparkContext. Since spark uses hadoop under the hood, Hadoop InputFormat` will still be the behaviour by default.

The first case should reflect defaultParallelism as mentioned here which may differ, depending on settings and hardware. (Numbers of cores, etc.)

So unless you provide the number of slices, that first case would be defined by the number described by sc.defaultParallelism:

scala> sc.defaultParallelism
res0: Int = 6

scala> sc.parallelize(1 to 100).partitions.size
res1: Int = 6

As for the second case, with sc.textFile, the number of slices by default is the minimum number of partitions.

Which is equal to 2 as you can see in this section of code.

Thus, you should consider the following :

  • sc.parallelize will take numSlices or defaultParallelism.

  • sc.textFile will take the maximum between minPartitions and the number of splits computed based on hadoop input split size divided by the block size.

    • sc.textFile calls sc.hadoopFile, which creates a HadoopRDD that uses InputFormat.getSplits under the hood [Ref. InputFormat documentation].

    • InputSplit[] getSplits(JobConf job, int numSplits) throws IOException : Logically split the set of input files for the job. Each InputSplit is then assigned to an individual Mapper for processing. Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be tuple. Parameters: job - job configuration. numSplits - the desired number of splits, a hint. Returns: an array of InputSplits for the job. Throws: IOException.

Example:

Let's create some dummy text files:

fallocate -l 241m bigfile.txt
fallocate -l 4G hugefile.txt

This will create 2 files, respectively, of size 241MB and 4GB.

We can see what happens when we read each of the files:

scala> val rdd = sc.textFile("bigfile.txt")
// rdd: org.apache.spark.rdd.RDD[String] = bigfile.txt MapPartitionsRDD[1] at textFile at <console>:27

scala> rdd.getNumPartitions
// res0: Int = 8

scala> val rdd2 = sc.textFile("hugefile.txt")
// rdd2: org.apache.spark.rdd.RDD[String] = hugefile.txt MapPartitionsRDD[3] at textFile at <console>:27

scala> rdd2.getNumPartitions
// res1: Int = 128

Both of them are actually HadoopRDDs:

scala> rdd.toDebugString
// res2: String = 
// (8) bigfile.txt MapPartitionsRDD[1] at textFile at <console>:27 []
//  |  bigfile.txt HadoopRDD[0] at textFile at <console>:27 []

scala> rdd2.toDebugString
// res3: String = 
// (128) hugefile.txt MapPartitionsRDD[3] at textFile at <console>:27 []
//   |   hugefile.txt HadoopRDD[2] at textFile at <console>:27 []
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Thanks for your response, but is this behavior same when reading the HDFS file using sc.textFile ? Lets say if I am reading the 640MB file in HDFS and input block size is 64MB. – Sri May 29 '17 at 14:17
  • @Sri textFile uses Hadoop InputFilFormat under the hood, so basically yes, it will be reading partitions by input block. – eliasah May 29 '17 at 15:38
  • [this section of code](https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329) sc.textFile uses defaultMinPartitions which is computed as per the logic def defaultMinPartitions: Int = math.min(defaultParallelism, 2). So as per the code sc.textFile should always give the result as 2 partitions when we do not specify the number of partitions while creating RDD. – Sri May 29 '17 at 22:56
  • I just created an RDD 'rdd3' for the text file test3.txt without specifying number of partitions. The file size is 241MB. `scala> val rdd3 = sc.textFile("C:/test3.txt") scala> rdd3.getNumPartitions res3: Int = 8` As per the code [this section of code](https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329) it should result in 2 partitions always but I see that my rdd3 is created with 8 partiotions. Can you please let me know why is it so ? – Sri May 29 '17 at 23:07
  • Thanks a lot for your detail explanation. The results are good as per your statement Statement: " sc.textFile will take the maximum between minPartitions and the number of splits computed based on hadoop input split size divided by the block size." – Sri May 30 '17 at 14:37
  • But I dont see the Spark code in Github [this section of code](https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329) supporting this statement. By looking at the Spark code in Github, I see that sc.textFile use the defaultMinPartitions which is computed as minimum of (defaultParallelism,2). Spark code: `def defaultMinPartitions: Int = math.min(defaultParallelism, 2)` – Sri May 30 '17 at 14:37
  • You need to follow the thread of execution which will lead you to HadoopRDD and then getSlices... – eliasah May 30 '17 at 14:40
  • Sorry can you please explain it more clearly, I did not get you. – Sri May 30 '17 at 15:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/145473/discussion-between-sri-and-eliasah). – Sri May 30 '17 at 15:05
  • 7
    I am still not clear. For 4 GB of file and block size of 128 MB. It should have given 4096/128=32 partitions, no? Why it gave 128 partitions. Also why 8 partitions for 241 MB file? Should it not be 2 partitions? – vijayinani Apr 28 '18 at 06:59
  • @vijayinani input split must be the measure of division, not the block size – jack AKA karthik Sep 05 '18 at 10:06
  • 1
    @eliasah I stumbled upon this thread and got confused with the number of partitions created here for a file sized 241 MB. I recreated files such that each record size is 66KB. For a file sized at 255MB, I got 8 partitions whereas for a file sized at 531MB, I got 17 partitions. I'm unable to understand how these partitions values are decided? – Abhash Upadhyaya Nov 11 '18 at 05:54
  • 2
    @ Abhash Upadhyaya values of partitions are based on input split size... input split is 32MB, so 255/32~8 and 531/32~17 – Abhinay May 18 '19 at 17:12
  • I am able to get the same result for 'sc.defaultParallelism' and hence can conclude that the value of this parameter is number of cores of the machine in local mode. But I am not satisfied with the explanation for sc.textFile("..") , it is a bit confusing. – nomadSK25 Oct 19 '19 at 19:01
  • @Sukumaar what is it that confused you ? "will take the maximum between minPartitions and the number of splits computed based on hadoop input split size divided by the block size." – eliasah Oct 20 '19 at 07:35