0

I have a very simple pyspark program that is supposed to read CSV files from S3:

r = sc.textFile('s3a://some-bucket/some-file.csv')
  .map(etc... you know the drill...)

This was failing when running a local Spark node (it works in EMR). I was getting OOM errors and GC crashes. Upon further inspection, I realized that the number of partitions was insanely high. In this particular case r.getNumPartitions() would return 2358041.

I realized that that's exactly the size of my file in bytes. This, of course, makes Spark crash miserably.

I've tried different configurations, like chaning mapred.min.split.size:

conf = SparkConf()
conf.setAppName('iRank {}'.format(datetime.now()))
conf.set("mapred.min.split.size", "536870912")
conf.set("mapred.max.split.size", "536870912")
conf.set("mapreduce.input.fileinputformat.split.minsize", "536870912")

I've also tried using repartition or changing passing a partitions argument to textFile, to no avail.

I would love to know what makes Spark think that it's a good idea to derive the number of partitions from the file size.

Cristian
  • 198,401
  • 62
  • 356
  • 264

2 Answers2

3

In general it doesn't. As nicely explained by eliasah in his answer to Spark RDD default number of partitions it uses max of minPartitions (2 if not provided) and splits computed by Hadoop input format.

The latter one will by unreasonably high, only if instructed by the configuration. This suggests that some configuration file interferes with your program.

The possible problem with your code is that you use wrong configuration. Hadoop options should be set using hadoopConfiguration not Spark configuration. It looks like you use Python so you have to use private JavaSparkContext instance:

sc = ...  # type: SparkContext

sc._jsc.hadoopConfiguration().setInt("mapred.min.split.size", min_value)
sc._jsc.hadoopConfiguration().setInt("mapred.max.split.size", max_value)
  • You can set this using spark properties too by using the spark.hadoop prefix. Eg: `conf.set("spark.hadoop.mapred.min.split.size", "536870912")` `conf.set("spark.hadoop.mapred.max.split.size", "536870912")` – Jelmer Aug 09 '21 at 08:31
1

There was actually a bug in Hadoop 2.6 which would do this; the initial S3A release didn't provide a block size to Spark to split up, the default of "0" meant one-byte-per-job.

Later version should all take fs.s3a.block.size as the config option specifying the block size...something like 33554432 (= 32 MB) would be a start.

If you are using Hadoop 2.6.x. Don't use S3A. That's my recommendation.

stevel
  • 12,567
  • 1
  • 39
  • 50