I have a pyspark dataframe which does multiple groupby, pivot kind of transformations and when I get a final dataframe after applying all mentioned transformations. Writing back the df as parquet by doing the partitioning
This take close to 1.67 hours for execution no. of node * no. of cores per node * 1 = 10 * 32 * 1
df.repartition(320)
df.write.partitionBy('year').mode('overwrite').parquet.path(PATH)
Also, I tried removing the repartition as well even then it takes more or less the same time. Quick help is much appreciated!!!