Pyspark Delta Lake Write Performance (Spark driver stopped)

Question

I need to create a Delta Lake file containing more than 150 KPIs. Since we have 150 calculations we roughly had to create around 60 odd data frames. Finally, the individual data frames are joined as one final data frame. This final data frame has around only 60k records only. But when finally creating the "Delta" lake file it is failing with this below error.

"The Spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached"

Our cluster configuration is pretty decent and stands at 144GB of memory and has 20 cores.

Any solution to overcome this issue. Thanks in advance.

show your transformations. You see that error when writing just because Spark doesn't execute code immediately but when action happens — Alex Ott, Jun 05 '22 at 08:24
@AlexOtt Unfortunately I cannot show you the code but I can explain how the notebook is structured. I have around 12 cells in the notebook in total and in the final cell I perform the 'write' action. The preceding cells performs almost all kinds of transformations like 'join','select', 'agg', 'withColumn', 'withColumnRenamed'. Each of the cell on an average calculates 10 to 15 columns in it. So, the accumulation of all these transformations are causing the 'Spark' Driver to fail? Any solution for this. — Prakazz, Jun 05 '22 at 10:14
I had the issue of too many computations on the final write so I created some intermediate work/temp tables in which I would write the data to offset some computation before the final write, so you can try doing that — Anjaneya Tripathi, Jun 05 '22 at 11:55
A simple way to segment the calculation is to cache some of the intermediate Data Frames. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cache.html?highlight=cache#pyspark.sql.DataFrame.cache — David Browne - Microsoft, Jun 05 '22 at 15:54

score 0 · Answer 1 · answered Jun 07 '22 at 10:38

There could be limitation to the number of records can be processed using your current Azure Subscription.

Instead of merging all records of a file into one, first try to merge 20-25 records at a time from a single file.

There is a similar question posted on Databricks official forum which you can refer but there is no such solution available. Therefore, this looks like a genuine bug.

Possible solution:

Use ...option("treatEmptyValueAsNulls","true").option("maxRowsInMemory",20) in the parameters when reading records and writing to a single file.

Pyspark Delta Lake Write Performance (Spark driver stopped)

1 Answers1