I have a very simple pyspark program that is supposed to read CSV files from S3:
r = sc.textFile('s3a://some-bucket/some-file.csv')
.map(etc... you know the drill...)
This was failing when running a local Spark node (it works in EMR). I was getting OOM errors and GC crashes. Upon further inspection, I realized that the number of partitions was insanely high. In this particular case r.getNumPartitions()
would return 2358041
.
I realized that that's exactly the size of my file in bytes. This, of course, makes Spark crash miserably.
I've tried different configurations, like chaning mapred.min.split.size
:
conf = SparkConf()
conf.setAppName('iRank {}'.format(datetime.now()))
conf.set("mapred.min.split.size", "536870912")
conf.set("mapred.max.split.size", "536870912")
conf.set("mapreduce.input.fileinputformat.split.minsize", "536870912")
I've also tried using repartition
or changing passing a partitions argument to textFile
, to no avail.
I would love to know what makes Spark think that it's a good idea to derive the number of partitions from the file size.