0

I have been scratching my head with a problem in pyspark.

I want to conditionally apply a UDF on a column depending on if it is NULL or not. One constraint is that I do not have access to the DataFrame at the location I am writing the code I only have access to a column object.

Thus, I cannot simply do:

df.where(my_col.isNull()).select(my_udf(my_col)).toPandas()

Therefore, having only access to a Column object, I was writing the following:

my_res_col = F.when(my_col.isNull(), F.lit(0.0) \
              .otherwise(my_udf(my_col))

And then later do:

df.select(my_res_col).toPandas()

Unfortunately for some reason that I do not know, I sill receive NULLs in my UDF, forcing me to check for NULL values directly in my UDF.

I do not understand why the isNull() is not preventing rows with NULL values from calling the UDF.

Any insight on this matter would be greatly appreciated.

I thank you in advance for your help.

Antoine

1 Answers1

0

I am not sure about your data. does it contains nan? spark handles null and nan differently. Differences between null and NaN in spark? How to deal with it?

so can you just try the below and check if it solves

import pyspark.sql.functions as F
my_res_col = F.when(((my_col.isNull())|(F.isnan(mycol))), F.lit(0.0)).otherwise(my_udf(my_col))
Raghu
  • 1,644
  • 7
  • 19