I have a 8.9GB text file and I create an RDD out of it and imported it in Spark.
textfile = sc.textFile("input.txt")
The number of partitions that Spark creates is 279, which is obtained by dividing the size of the input file by 32MB default HDFS block size. I can pass an argument to textfile and ask for more number of partitions, however, unfortunately I can not have fewer number of partitions than this default value (e.g., 4).
If I pass 4 as an argument Spark would ignore it and would proceed with 279 partitions.
Since my underlying filesystem is not HDFS, to me it seems very inefficient to split the input size in too many partitions. How can I force Spark to use fewer number of partitions? How can I change the default HDFS block size in Spark with a larger value?