I have been scratching my head with a problem in pyspark.
I want to conditionally apply a UDF on a column depending on if it is NULL or not. One constraint is that I do not have access to the DataFrame at the location I am writing the code I only have access to a column object.
Thus, I cannot simply do:
df.where(my_col.isNull()).select(my_udf(my_col)).toPandas()
Therefore, having only access to a Column object, I was writing the following:
my_res_col = F.when(my_col.isNull(), F.lit(0.0) \
.otherwise(my_udf(my_col))
And then later do:
df.select(my_res_col).toPandas()
Unfortunately for some reason that I do not know, I sill receive NULLs in my UDF, forcing me to check for NULL values directly in my UDF.
I do not understand why the isNull()
is not preventing rows with NULL values from calling the UDF.
Any insight on this matter would be greatly appreciated.
I thank you in advance for your help.
Antoine