1

I have a 8.9GB text file and I create an RDD out of it and imported it in Spark.

textfile = sc.textFile("input.txt")

The number of partitions that Spark creates is 279, which is obtained by dividing the size of the input file by 32MB default HDFS block size. I can pass an argument to textfile and ask for more number of partitions, however, unfortunately I can not have fewer number of partitions than this default value (e.g., 4).

If I pass 4 as an argument Spark would ignore it and would proceed with 279 partitions.

Since my underlying filesystem is not HDFS, to me it seems very inefficient to split the input size in too many partitions. How can I force Spark to use fewer number of partitions? How can I change the default HDFS block size in Spark with a larger value?

user4157124
  • 2,809
  • 13
  • 27
  • 42
MPAK
  • 39
  • 2
  • 5
  • Maybe your can adjust the performance of your application if you change the level of parallelism.The default partitioner reads `spark.default.parallelism`. Try adjusting your partitions also by setting `spark.default.parallelism` property in the config file. The definition of this property in the config file is": *Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.* – Roxana Roman Aug 07 '15 at 19:25
  • Thanks for your reply. With default parallelism you cannot go below the number of default partitions created by Spark. You can use that parameter to increase the number of partitions beyond the default value. – MPAK Aug 11 '15 at 23:08

3 Answers3

0

In your case since the block size is 32mb, you are getting 279 partitions. You can increase the block size in your HDFS to any other suitable value so that it fits your requirement. You can find the block size parameter in hdfs-site.xml

Metadata
  • 2,127
  • 9
  • 56
  • 127
0

I have encountered the same error. I have tried changing the following setting:

conf.set("spark.hadoop.dfs.block.size", str(min_block_size))
conf.set("spark.hadoop.mapreduce.input.fileinputformat.split.minsize", str(min_block_size))
conf.set("spark.hadoop.mapreduce.input.fileinputformat.split.maxsize", str(max_block_size))

None of them are actually changing the input size and the size still remains 32 MB. Then, realised that I am using a local file system and not HDFS so probably that is why it is not working. I found another configuration that is supposed to work with local file (I think) as below.

# The maximum number of bytes to pack into a single partition when reading files.
conf.set("spark.files.maxPartitionBytes", str(min_block_size)) 

However, there is no effect at all. I tried one more configuration change by adding the following:

conf.set("spark.sql.files.maxPartitionBytes", str(sql_block_size))

It changed the input size for the data frames but not RDD :(.

If anyone has found any configuration that has actually changed the input size for RDD, I would appreciate the answer.

0

I've also tried most of the configurations , finally I worked with repartition()

textfile = sc.textFile("input.txt").repartition(2)
textfile.getNumPartitions
# result
2
myeongkil kim
  • 2,465
  • 4
  • 16
  • 22