pyspark fill missing dates in dataframe and fill other column with minimum of the two adjacent values

Question

looking to fill the pyspark dataframe and load the missing values. Existing Pyspark DataFrame -

ID	Date	Qty
100	2023-02-01	5
100	2023-02-03	3
100	2023-02-04	3
100	2023-02-05	3
100	2023-02-08	3
100	2023-02-09	11
100	2023-02-10	11
100	2023-02-11	10
100	2023-02-13	0

Expected Pyspark DataFrame (filling bold values) -

ID	Date	Qty
100	2023-02-01	5
100	2023-02-02	3 *add row date and lowest adjacent qty
100	2023-02-03	3
100	2023-02-04	3
100	2023-02-05	3
100	2023-02-06	3 *add row date and lowest adjacent qty
100	2023-02-07	3 *add row date and lowest adjacent qty
100	2023-02-08	3
100	2023-02-09	11
100	2023-02-10	11
100	2023-02-11	10
100	2023-02-12	0 *add row date and lowest adjacent qty
100	2023-02-13	0

did refer to existing answer but it doesnt fulfill my requirement of filling the lowest adjacent value(PySpark generate missing dates and fill data with previous value)

is it possible that both adjacent values could be NULL/missing? otherwise @Emma's suggestion will work — Chris, Feb 27 '23 at 16:40

score 1 · Answer 1 · answered Feb 27 '23 at 16:43

Based on comment above, given at least one adjacent value is present:

import datetime
import pyspark.sql.functions as F
from pyspark.sql.window import Window

date_list = [[(datetime.date(2023, 2, 1) + datetime.timedelta(days=x)).strftime("%Y-%m-%d")] for x in range(13)]
df_dt = spark.createDataFrame(date_list, ['date'])

tbl = [
  (100,'2023-02-01',5)
, (100,'2023-02-03',3)
, (100,'2023-02-04',3)
, (100,'2023-02-05',3)
, (100,'2023-02-08',3)
, (100,'2023-02-09',11)
, (100,'2023-02-10',11)
, (100,'2023-02-11',10)
, (100,'2023-02-13',0)
]

df_data = spark.createDataFrame(tbl, ['id','date','qty'])


df_ext = df_dt.join(df_data, on='date', how='left').orderBy('id','date')

window_spec = Window.orderBy('date') 

df_lead_lag = df_ext.withColumn(
    'lag', F.lag('qty', 1).over(window_spec)
  ).withColumn(
    'lead', F.lead('qty', 1).over(window_spec)
  )

df_result = df_lead_lag.withColumn('new_qty', F.coalesce(F.least('lag','lead'), 'qty'))

df_result.show()

Note also spending on the ID field, you may need to fill this in and use a `partitionBy('id')` in your Window Spec. — Chris, Feb 27 '23 at 16:46
thanks Chris. This will work for up to 2 missing dates. thanks for sharing but I would want to have to more scalable solution. — NNM, Feb 27 '23 at 18:36

score 1 · Accepted Answer · answered Feb 28 '23 at 01:24

Check out the following solution and let me know what you think:

import pyspark.sql.functions as f
from pyspark.sql.types import *
import datetime

tbl = [
  (100,'2023-02-01',5)
, (100,'2023-02-03',3)
, (100,'2023-02-04',3)
, (100,'2023-02-05',3)
, (100,'2023-02-08',3)
, (100,'2023-02-09',11)
, (100,'2023-02-10',11)
, (100,'2023-02-11',10)
, (101,'2023-02-13',0) #I've added one extra id to test multiple ids scenario.
]
df = spark.createDataFrame(tbl, ['id','date','qty'])

date_range = (
    df
    .select(
        f.min('date').alias('min_date'),
        f.max('date').alias('max_date')
    )
    .collect()[0]
)
min_date, max_date = datetime.datetime.strptime(date_range.min_date, '%Y-%m-%d'), datetime.datetime.strptime(date_range.max_date, '%Y-%m-%d')
df_date_range = spark.createDataFrame([((min_date + datetime.timedelta(days = element)).strftime('%Y-%m-%d'),) for element in range((max_date - min_date).days + 1)], ['date'])

df_id = (
    df
    .select('id')
    .distinct()
)

df_output = (
    df_id.crossJoin(df_date_range)
    .join(df.alias('df'), ['id', 'date'], 'left')
    .select('id', 'date', 'df.qty')
    .withColumn('potential_preceding_qty', f.last(f.col('qty'), ignorenulls = True).over(Window.partitionBy('id').orderBy('date').rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)))
    .withColumn('potential_following_qty', f.first(f.col('qty'), ignorenulls = True).over(Window.partitionBy('id').orderBy('date').rowsBetween(Window.currentRow + 1, Window.unboundedFollowing)))
    .withColumn('qty', f.coalesce(f.col('qty'), f.least(f.col('potential_preceding_qty'), f.col('potential_following_qty'))))
    .drop('potential_preceding_qty', 'potential_following_qty')
)

this works to the requirement. thanks for sharing. – NNM Feb 28 '23 at 14:21 — NNM, Feb 28 '23 at 14:21

pyspark fill missing dates in dataframe and fill other column with minimum of the two adjacent values

2 Answers2