I want to create a new boolean column in my dataframe that derives its value from the evaluation of two conditional statements on other columns in the same dataframe:
columns = ["id", "color_one", "color_two"]
data = spark.createDataFrame([(1, "blue", "red"), (2, "red", None)]).toDF(*columns)
data = data.withColumn('is_red', data.color_one.contains("red") | data.color_two.contains("red"))
This works fine unless either color_one
or color_two
is NULL in a row. In cases like these, is_red
is also set to NULL
for that row instead of true
or false
:
+-------+----------+------------+-------+
|id |color_one |color_two |is_red |
+-------+----------+------------+-------+
| 1| blue| red| true|
| 2| red| NULL| NULL|
+-------+----------+------------+-------+
This means that PySpark is evaluating all of the clauses of the conditional statement rather than exiting early (via short-circuit evaluation) if the first condition happens to be true (like in row 2 of my example above).
Does PySpark support the short-circuit evaluation of conditional statements?
In the meantime, here is a workaround I have come up with to null-check each column:
from pyspark.sql import functions as F
color_one_is_null = data.color_one.isNull()
color_two_is_null = data.color_two.isNull()
data = data.withColumn('is_red', F.when(color_two_is_null, data.color_one.contains("red"))
.otherwise(F.when(color_one_is_null, data.color_two.contains("red"))
.otherwise(F.when(color_one_is_null & color_two_is_null, F.lit(False))
.otherwise(data.color_one.contains("red") | data.color_two.contains("red"))))
)