I have the following DataFrame in PySpark:
Id DateActual DateStart DateEnd SourceCode
107 2019-08-11 00:00:00 null null 1111
107 2019-08-16 00:00:00 2019-08-11 00:00:00 2019-08-18 00:00:00 1111
128 2019-02-11 00:00:00 null null 101
128 2019-02-13 00:00:00 2019-02-11 00:00:00 2019-02-18 00:00:00 168
128 2019-02-14 00:00:00 2019-02-13 00:00:00 2019-02-20 00:00:00 187
I need to substitute null
values in order to get the following result:
Id DateActual DateStart DateEnd SourceCode
107 2019-08-11 00:00:00 2019-08-11 00:00:00 2019-08-18 00:00:00 1111
107 2019-08-16 00:00:00 2019-08-11 00:00:00 2019-08-18 00:00:00 1111
128 2019-02-11 00:00:00 2019-02-11 00:00:00 2019-02-18 00:00:00 101
128 2019-02-13 00:00:00 2019-02-11 00:00:00 2019-02-18 00:00:00 168
128 2019-02-14 00:00:00 2019-02-13 00:00:00 2019-02-20 00:00:00 187
Basically, DateStart
and DateEnd
with null
values are equal to DateStart
and DateEnd
of the NEXT row if it has the same Id
.
How can I fill out the null
values following the above-described logic in PySpark?
DataFrame:
df = (
sc.parallelize([
(107, "2019-08-11 00:00:00", None, None, 1111),
(107, "2019-08-16 00:00:00", "2019-08-11 00:00:00", "2019-08-18 00:00:00", 1111),
(128, "2019-02-11 00:00:00", None, None, 101),
(128, "2019-02-13 00:00:00", "2019-02-11 00:00:00", "2019-02-11 00:00:00", 168),
(128, "2019-02-14 00:00:00", "2019-02-13 00:00:00", "2019-02-20 00:00:00", 187)
]).toDF(["Id", "DateActual", "DateStart", "DateEnd", "SourceCode"])
)
This is what I tried:
from pyspark.sql.functions import col, when
import pyspark.sql.functions as F
from pyspark.sql.window import Window
my_window = Window.partitionBy("Id").orderBy("DateActual")
df.withColumn("DateStart_start", when(col("DateStart").isNull(), F.lag(df.DateStart).over(my_window)).otherwise(col("DateStart"))).show()
I do not need a trivial solution as df.na.fill(0)
. I need to substitute null
values with NEXT ROW values, which probably assumes using lag
or other similar function.