Spark sc.binaryFiles() partitioning small files and YARN

Question

Using the sc.binaryFiles() function in Spark 2.3.0 on a Hortonworks 2.6.5 server, I noticed its behavior which I cannot explain regarding the default partitioning in a YARN managed cluster. Please see the sample code below:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext

object ReadTestYarn extends App {

  Logger.getLogger("org").setLevel(Level.ERROR)

  val sc = new SparkContext("yarn", "ReadTestYarn")

  val inputRDD1 = sc.textFile("hdfs:/user/maria_dev/readtest/input/*")
  val inputRDD2 = sc.binaryFiles("hdfs:/user/maria_dev/readtest/input/*")

  println("Num of RDD1 partitions: " + inputRDD1.getNumPartitions)
  println("Num of RDD2 partitions: " + inputRDD2.getNumPartitions)

}


[maria_dev@sandbox-hdp readtest]$ spark-submit --master yarn --deploy-mode client --class ReadTestYarn ReadTest.jar

Num of RDD1 partitions: 10
Num of RDD2 partitions: 1

The data I use is small, 10 csv files, each about 4-5MB in size, 43MB in total. In the case of RDD1, the number of the resulting partitions is understandable and the calculation method is well explained in the following post and article:

Spark RDD default number of partitions

https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7

But with RDD2, binaryFiles() function and master URL passed to Spark as "yarn", the number of partitions created is only 1 which I don't understand exactly.

@Mark Rajcok has given some explanation in the post below, but the link to commit changes there is not working. Could someone please provide detailed explanation about creating only one partition in this case?

PySpark: Partitioning while reading a binary file using binaryFiles() function

Have you also tried `minPartitions = SOME_NUMBER` as a parameter to `.binaryFiles` call? — AminMal, Jun 30 '22 at 22:02
@AminMal, yes of course, no matter what the size of minPartitions, the number of partitions was the same. Additionally, by experimenting to the HDFS directory with 10 original csv files, I put two larger files created with the `fallocate` command. The first is 270MB and the second is 600MB. Now the number of partitions created using `sc.binaryFiles()` regardless of the passed master URL is 3. After checking the contents with the command `.glom().collect()` in the first partition there is a 270MB file, in the second 600MB, and in the third all the rest - 10 original csv files. — uhlik, Jul 01 '22 at 17:14

Spark sc.binaryFiles() partitioning small files and YARN

0 Answers0