Using the sc.binaryFiles() function in Spark 2.3.0 on a Hortonworks 2.6.5 server, I noticed its behavior which I cannot explain regarding the default partitioning in a YARN managed cluster. Please see the sample code below:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
object ReadTestYarn extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("yarn", "ReadTestYarn")
val inputRDD1 = sc.textFile("hdfs:/user/maria_dev/readtest/input/*")
val inputRDD2 = sc.binaryFiles("hdfs:/user/maria_dev/readtest/input/*")
println("Num of RDD1 partitions: " + inputRDD1.getNumPartitions)
println("Num of RDD2 partitions: " + inputRDD2.getNumPartitions)
}
[maria_dev@sandbox-hdp readtest]$ spark-submit --master yarn --deploy-mode client --class ReadTestYarn ReadTest.jar
Num of RDD1 partitions: 10
Num of RDD2 partitions: 1
The data I use is small, 10 csv files, each about 4-5MB in size, 43MB in total. In the case of RDD1, the number of the resulting partitions is understandable and the calculation method is well explained in the following post and article:
Spark RDD default number of partitions
https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7
But with RDD2, binaryFiles() function and master URL passed to Spark as "yarn", the number of partitions created is only 1 which I don't understand exactly.
@Mark Rajcok has given some explanation in the post below, but the link to commit changes there is not working. Could someone please provide detailed explanation about creating only one partition in this case?
PySpark: Partitioning while reading a binary file using binaryFiles() function