3

I have the following DataFrame in PySpark:

Id      DateActual          DateStart               DateEnd                 SourceCode
107 2019-08-11 00:00:00     null                    null                    1111
107 2019-08-16 00:00:00     2019-08-11 00:00:00     2019-08-18 00:00:00     1111
128 2019-02-11 00:00:00     null                    null                    101
128 2019-02-13 00:00:00     2019-02-11 00:00:00     2019-02-18 00:00:00     168
128 2019-02-14 00:00:00     2019-02-13 00:00:00     2019-02-20 00:00:00     187

I need to substitute null values in order to get the following result:

Id      DateActual          DateStart               DateEnd                 SourceCode
107 2019-08-11 00:00:00     2019-08-11 00:00:00     2019-08-18 00:00:00     1111
107 2019-08-16 00:00:00     2019-08-11 00:00:00     2019-08-18 00:00:00     1111
128 2019-02-11 00:00:00     2019-02-11 00:00:00     2019-02-18 00:00:00     101
128 2019-02-13 00:00:00     2019-02-11 00:00:00     2019-02-18 00:00:00     168
128 2019-02-14 00:00:00     2019-02-13 00:00:00     2019-02-20 00:00:00     187

Basically, DateStart and DateEnd with null values are equal to DateStart and DateEnd of the NEXT row if it has the same Id.

How can I fill out the null values following the above-described logic in PySpark?

DataFrame:

df = (
    sc.parallelize([
        (107, "2019-08-11 00:00:00", None, None, 1111),
        (107, "2019-08-16 00:00:00", "2019-08-11 00:00:00", "2019-08-18 00:00:00", 1111),
        (128, "2019-02-11 00:00:00", None, None, 101), 
        (128, "2019-02-13 00:00:00", "2019-02-11 00:00:00", "2019-02-11 00:00:00", 168), 
        (128, "2019-02-14 00:00:00", "2019-02-13 00:00:00", "2019-02-20 00:00:00", 187)
    ]).toDF(["Id", "DateActual", "DateStart", "DateEnd", "SourceCode"])
)

This is what I tried:

from pyspark.sql.functions import col, when 
import pyspark.sql.functions as F
from pyspark.sql.window import Window  

my_window = Window.partitionBy("Id").orderBy("DateActual")

df.withColumn("DateStart_start", when(col("DateStart").isNull(), F.lag(df.DateStart).over(my_window)).otherwise(col("DateStart"))).show()

I do not need a trivial solution as df.na.fill(0). I need to substitute null values with NEXT ROW values, which probably assumes using lag or other similar function.

ZygD
  • 22,092
  • 39
  • 79
  • 102
Fluxy
  • 2,838
  • 6
  • 34
  • 63
  • I guess they downvoted your question because your question already existed in SO. You are supposed to search first to see if your question has already being asked or not. If so, it is better to avoid to post a redundant question. Is this post answer your question by any chance? https://stackoverflow.com/questions/42312042/how-to-replace-all-null-values-of-a-dataframe-in-pyspark – alift Jan 10 '20 at 14:03
  • @alift: Please read carefully the question before downvoting. In your recommended post `df.na.fill(0)` fills Null values with 0. This would be too trivial. I need to follow the logic as described in my post. – Fluxy Jan 10 '20 at 14:05

1 Answers1

5

Use first from pyspark.sql.functions:

from pyspark.sql import Window
from pyspark.sql.functions import first

# define the window
window = Window.partitionBy('Id')\
               .orderBy('DateActual')\
               .rowsBetween(0,sys.maxsize)

# define the back-filled column
filled_column_start = first(spark_df['DateStart'], ignorenulls=True).over(window)
filled_column_end = first(spark_df['DateEnd'], ignorenulls=True).over(window)

# do the fill
spark_df_filled = spark_df.withColumn('filled_start', filled_column_start)
spark_df_filled = spark_df_filled .withColumn('filled_end', filled_column_end)

# show off our glorious achievements
spark_df_filled.orderBy('Id').show(10)  
Travis
  • 1,152
  • 9
  • 25
  • i think the key words of question are backfill null values in each partition. You can revise your post title and body to make it more clarified. So that your question may help a lot of people.@Fluxy – Travis Jan 10 '20 at 14:54