How can I optimize writing BIG files?

Question

I have a dataframe with 120 millions of records and when i try to write it, takes 45 minutes. The DF is partitioned by a field "id_date" which is in this format: yyyymmdd. This DF is a delta table in databricks. I try with autooptimize, compact, etc from delta tables properties but i don´t get any advance. Sorry for my bad english. Anyone can help me? Thanks and regards

I am trying to write a df with many records (more than 100 million) and it takes me a long time. I need to optimize this process.

No, do you think that repartition helps? In that case the number of "N" should be get from df.rdd.getNumPartitions()). Thanks — tempo, Jun 05 '23 at 11:44
Increasing the partitions means increasing the parallelism of processing/saving the data, but you need to take care of the resources limitation so it doesn't become an overhead, so we don't use the `df.rdd.getNumPartitions()` since this will have no effect but usually it is the number of cores in the cluster * a factor of 2 or 3, so if you have 10 cores then you can try to set it as 20 or 30 — Islam Elbanna, Jun 05 '23 at 12:08
i have tried this but still takes very long time. Here you can see the spark_ui. https://drive.google.com/file/d/16Zu-bKe_WBr-00k1LGHGLWx6LqY-j3xu/view?usp=sharing In executor 2 the records are increasing up to, in this case, 16 million (you can see in image that is in 2 m. ). But all in the same executor so it takes a long time. Any suggestions. Thanks again — tempo, Jun 05 '23 at 13:26
Could you try to add the partition column as well `.repartition(N, $"id_date")`? — Islam Elbanna, Jun 05 '23 at 13:30
Yes, that sparkui image is with this command: display(df_join_tornos.repartition(4, "id_fecha").count()) — tempo, Jun 05 '23 at 13:43
Could it be that the write operation is not the problem but spark is spending time processing the transformations? Could you share the execution plan of your dataframe? https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.explain.html — Bernard Jesop, Jun 11 '23 at 13:37

How can I optimize writing BIG files?

0 Answers0