Forward-fill missing data in PySpark not working

Question

I have a simple dataset as shown under.

|          id|         name|             country|          languages|    
|1           |          Bob|                 USA|            Spanish|
|2           |     Angelina|              France|               null|                
|3           |         Carl|              Brazil|               null|              
|4           |         John|           Australia|            English|        
|5           |         Anne|               Nepal|               null|

I am trying to impute the null values in languages with the last non-null value using pyspark.sql.window to create a window over certain rows but nothing is happening. The column which is supposed to be have null values filled, temp_filled_spark, remains unchanged i.e a copy of original languages column.

from pyspark.sql import Window
from pyspark.sql.functions import last

window = Window.partitionBy('name').orderBy('country').rowsBetween(-sys.maxsize, 0)
filled_column = last(df['languages'], ignorenulls=True).over(window)

df = df.withColumn('temp_filled_spark', filled_column)

df.orderBy('name', 'country').show(100)

I expect the output column to be:

|temp_filled_spark|    
|     Spanish|
|     Spanish|                
|     Spanish|              
|     English|
|     English|

Could anybody help pointing out the mistake?

you are partitioning by name column and they are distinct values. so, window will be created for each value. — Suresh, Apr 05 '18 at 11:34
Thanks @Suresh. What if I don't know if I have distinct or non distinct values in my dataset? Is it possible to impute from last non null value without creating a window? — user8907896, Apr 05 '18 at 11:59

score 0 · Answer 1 · answered Apr 05 '18 at 12:51

we can create window considering entire dataframe as one partition as,

from pyspark.sql import functions as F
>>> df1.show()
+---+--------+---------+---------+
| id|    name|  country|languages|
+---+--------+---------+---------+
|  1|     Bob|      USA|  Spanish|
|  2|Angelina|   France|     null|
|  3|    Carl|   Brazil|     null|
|  4|    John|Australia|  English|
|  5|    Anne|    Nepal|     null|
+---+--------+---------+---------+

>>> w = Window.partitionBy(F.lit(1)).orderBy(F.lit(1)).rowsBetween(-sys.maxsize, 0)
>>> df1.select("*",F.last('languages',True).over(w).alias('newcol')).show()
+---+--------+---------+---------+-------+
| id|    name|  country|languages| newcol|
+---+--------+---------+---------+-------+
|  1|     Bob|      USA|  Spanish|Spanish|
|  2|Angelina|   France|     null|Spanish|
|  3|    Carl|   Brazil|     null|Spanish|
|  4|    John|Australia|  English|English|
|  5|    Anne|    Nepal|     null|English|
+---+--------+---------+---------+-------+

Hope this helps.!

Here's my code, @Suresh. Dataset I am using has been taken from here: [link](https://www.kaggle.com/unsdsn/world-happiness/version/2) `window = Window.partitionBy(f.lit(1)).orderBy(f.lit(1)).rowsBetween(-sys.maxsize, 0) df.select("*", f.last('Standard Error', True).over(window).alias('newcol')).show(50)` — user8907896, Apr 05 '18 at 13:27

Forward-fill missing data in PySpark not working

1 Answers1