0

I have a dataframe with 120 millions of records and when i try to write it, takes 45 minutes. The DF is partitioned by a field "id_date" which is in this format: yyyymmdd. This DF is a delta table in databricks. I try with autooptimize, compact, etc from delta tables properties but i don´t get any advance. Sorry for my bad english. Anyone can help me? Thanks and regards

I am trying to write a df with many records (more than 100 million) and it takes me a long time. I need to optimize this process.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
tempo
  • 23
  • 4
  • Did you try to `.repartition(N)` before writing? – Islam Elbanna Jun 05 '23 at 11:33
  • No, do you think that repartition helps? In that case the number of "N" should be get from df.rdd.getNumPartitions()). Thanks – tempo Jun 05 '23 at 11:44
  • Increasing the partitions means increasing the parallelism of processing/saving the data, but you need to take care of the resources limitation so it doesn't become an overhead, so we don't use the `df.rdd.getNumPartitions()` since this will have no effect but usually it is the number of cores in the cluster * a factor of 2 or 3, so if you have 10 cores then you can try to set it as 20 or 30 – Islam Elbanna Jun 05 '23 at 12:08
  • i have tried this but still takes very long time. Here you can see the spark_ui. https://drive.google.com/file/d/16Zu-bKe_WBr-00k1LGHGLWx6LqY-j3xu/view?usp=sharing In executor 2 the records are increasing up to, in this case, 16 million (you can see in image that is in 2 m. ). But all in the same executor so it takes a long time. Any suggestions. Thanks again – tempo Jun 05 '23 at 13:26
  • Could you try to add the partition column as well `.repartition(N, $"id_date")`? – Islam Elbanna Jun 05 '23 at 13:30
  • Yes, that sparkui image is with this command: display(df_join_tornos.repartition(4, "id_fecha").count()) – tempo Jun 05 '23 at 13:43
  • Could it be that the write operation is not the problem but spark is spending time processing the transformations? Could you share the execution plan of your dataframe? https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.explain.html – Bernard Jesop Jun 11 '23 at 13:37

0 Answers0