Why PySpark code hangs for a while and then terminates abruptly while accessing a dataframe

Question

Problem Statement: The PySpark program hangs when it is reading records from a dataframe based on a condition where a particular field is NOT NULL. This field is a string field and may or many not contain a string value in it. Any operation on this string field, such as checking for NULL, calculation of length of the field, will result in hanging the code and then terminating.

Description: For example, in our case, the PySpark program reads the data from a file and loads into the dataframe. A new column named 'ErDesc' is added to the dataframe. This field is dynamically populated with the comma separated error descriptions when the data validation for the record fails for any field. At the end of all the checks, when the dataframe is read to identify the records where the 'ErDesc' field is is NULL (valid records) then sometimes the activity completes successfully and sometimes the program hangs and then terminates.

What we did so far: We tried to solve this problem by defining the initial value for 'ErDesc' as '' or "" instead of 'NULL'. However after processing the records through all the data validation, whenever we checked the dataframe for 'ErDesc' to be '' or "" or NULL the process hung and terminated. The confusing part was that, the records were processed through multiple iterations and for initial 2 iterations this check for 'ErDesc' worked fine but then for the next iteration it would hung and then terminate. We modified the code to skip this iteration and continue with the next iteration. Again the code successfully completed the first two iteration, skipped the third iteration, successfully executed the fourth iteration and again hung in the fifth iteration and terminated. The behavior of the code was completely irrational. To add to the confusion, the error dataframe was created by checking for the error records in the parent dataframe by checking the ErDesc as NOT NULL. However the code was hanging at the stage where the error dataframe was used to load the data into the database. We initially thought it might be database level issue, but eventually found that it was due to lazy evaluation in pyspark that the error dataframe was executed only when it was accessed for loading into the database table.

Sam · Answer 1 · 2023-05-30T12:42:14.887

Solution: To solve this issue, we defined an integer column called 'ErInt' along with the column 'ErDesc' as given below.

.withColumn("ErDesc", lit(""))
.withColumn("ErInt",lit(0))

We changed the field ErInt to 1 whenever there is an error found in the record and continued population of the ErDesc column with relevant error description values. Then during the identification of the error records in the dataframe, we checked the 'ErInt == 1' instead of checking for ErDesc not equal to NULL or '' or "". With this approach all the iteration executed successfully and the code hanging issue was resolved.

Conclusion: So in summary, if your pyspark code is checking for string field for NULL value and the code is hanging and terminating, switch to using the interger value instead if possible. This will solve the issue.

Why PySpark code hangs for a while and then terminates abruptly while accessing a dataframe

1 Answers1