partitionBy taking too long while saving a dataset on S3 using Pyspark

Question

I am trying to save a dataset using partitionBy on S3 using pyspark. I am partitioning by on a date column. Spark job is taking more than hour to execute it. If i run the code without partitionBy it just takes 3-4 mints. Could somebody help me in fining tune the parititonby?

there are almost 2000 distinct values for partitioning column so that means there will be 2000 partitions. And there are 30-34k records in each partition on average. — us56, Jun 07 '19 at 20:59

score 1 · Answer 1 · answered Jun 09 '19 at 03:12

Ok, so spark is terrible at doing IO. Especially with respect to s3. Currently when you are writing in spark it will use a whole executor to write the data SEQUENTIALLY. That with the back and forth between s3 and spark leads to it being quite slow. So you can do a few things to help mitigate/side step these issues.

Use a different partitioning strategy, if possible, with the goal being minimizing files written.
If there is a shuffle involved before the write, you can change the settings around default shuffle size: spark.sql.shuffle.partitions 200 // 200 is the default you'll probably want to reduce this and/or repartition the data before writing.
You can go around sparks io and write your own hdfs writer or use s3 api directly. Using something like foreachpartition then a function for writing to s3. That way things will write in parallel instead of sequentially.
Finally, you may want to use repartition and partitionBy together when writing (DataFrame partitionBy to a single Parquet file (per partition)). This will lead to one file per partition when mixed with maxRecordsPerFile (below) above this will help keep your file size down.

As a side note: you can use the option spark.sql.files.maxRecordsPerFile 1000000 to help control file sizes to make sure they don't get out of control.

In short, you should avoid creating too many files, especially small ones. Also note: you will see a big performance hit when you go to read those 2000*n files back in as well.

We use all of the above strategies in different situations. But in general we just try to use a reasonable partitioning strategy + repartitioning before write. Another note: if a shuffle is performed your partitioning is destroyed and sparks automatic partitioning takes over. Hence, the need for the constant repartitioning.

Hope these suggestions help. SparkIO is quite frustrating but just remember to keep files read/written to a minimum and you should see fine performance.

Thanks for these details. Using repartition and partitionBy, is there a way to parallelize the writes for the specific partitions? The data I am trying to partitionBy is heavily skewed, so I end up with a small number of executors that run for a very long time. Even with maxRecordsPerFile, this is the case. If i add another cololmn to repartition, my small partitions end up with many files. So is this a tradeoff with # of files generated and how much paralleization can be done during write? — scrayon, Jan 11 '20 at 20:43
Sounds like you need to salt your data https://medium.com/appsflyer/salting-your-spark-to-scale-e6f1c87dd18. The process described in the link involves random salting, but it doesn't have to be random. The basic process involves creating a synthetic partition field to add to your other fields that deals with skew by splitting up the heavily skewed data. In your case, you can use a window function that creates a new id within each of your other partitions when it reaches too large a value. For instance if a partition has more than 100,000 records create a new id in your salting partition field. — Robert Beatty, Jan 12 '20 at 23:37

score 1 · Answer 2 · answered Aug 30 '19 at 17:33

1

Use version 2 of the FileOutputCommiter

.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

answered Aug 30 '19 at 17:33

Philip K. Adetiloye

3,102
4
37
63

partitionBy taking too long while saving a dataset on S3 using Pyspark

2 Answers2