Writing Pyspark dataframe as parquet by PartitionBy dataframe becomes very slow

Question

I have a pyspark dataframe which does multiple groupby, pivot kind of transformations and when I get a final dataframe after applying all mentioned transformations. Writing back the df as parquet by doing the partitioning

This take close to 1.67 hours for execution no. of node * no. of cores per node * 1 = 10 * 32 * 1

df.repartition(320)

df.write.partitionBy('year').mode('overwrite').parquet.path(PATH)

Also, I tried removing the repartition as well even then it takes more or less the same time. Quick help is much appreciated!!!

Could you add the entire code snippet rather than couple of LOC. Also whats the cluster configuration, any executor tuning done, input data size etc? — Rohit Anil, Jun 23 '23 at 14:44
@RohitAnil Assume ```df = dft.join(df2)``` then ```interim_df = df.groupBy("col")``` again ```final_df = df.join(interim_df)``` and then at the end doing this ```df.write.partitionBy('year').mode('overwrite').parquet.path(PATH)``` — Raja Sabarish PV, Jun 23 '23 at 18:09
I see that you are grouping df by a column but with no aggregation function. And then joining the output with the same df that you have used for the grouping. Im kind of confused as to why this is being done. — Rohit Anil, Jun 23 '23 at 21:31
@RohitAnil I am doing i groupby with Agg.. Can I use window function? — Raja Sabarish PV, Jun 26 '23 at 14:40
Depends on what you are trying to achieve here. If you could reframe your question with some dummy data, the code you have written and what you are trying to achieve, its easier to help — Rohit Anil, Jun 26 '23 at 16:52

score 0 · Answer 1 · answered Jun 23 '23 at 21:28

Some of the suggestions would be

Use persist/ cache on the dataframe thats being used at multiple places. Based on the comment, df can be persisted/ cached.
See if the join key is skewed. If it is, use salted key for joining. Consider broadcasting if the one of the dataframes is smaller.
Instead of using the default executor configuration, try executor tuning such as number of cores per executor, executor instances, executor memory, shuffle partitions etc.
Use coalesce instead of repartition as this would avoid shuffles.
Based on the size of data, choose a cluster size and instance type and perform executor tuning based on that.
Check if the input data is just a single file that is compressed. If yes, that can also cause a bottleneck.

Writing Pyspark dataframe as parquet by PartitionBy dataframe becomes very slow

1 Answers1