I have a pypark dataframe in the following way:
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| 1| 3|
| 2| NaN| 4|
| 3| 3| 5|
+---+----+----+
I would like to sum col1
and col2
so that the result looks like this:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4| 4|
| 3| 3| 5| 8|
+---+----+----+---+
Here's what I have tried:
import pandas as pd
test = pd.DataFrame({
'id': [1, 2, 3],
'col1': [1, None, 3],
'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()
This code returns:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4|NaN| # <-- I want a 4 here, not this NaN
| 3| 3| 5| 8|
+---+----+----+---+
Can anyone help me with this?