2

looking to fill the pyspark dataframe and load the missing values. Existing Pyspark DataFrame -

ID Date Qty
100 2023-02-01 5
100 2023-02-03 3
100 2023-02-04 3
100 2023-02-05 3
100 2023-02-08 3
100 2023-02-09 11
100 2023-02-10 11
100 2023-02-11 10
100 2023-02-13 0

Expected Pyspark DataFrame (filling bold values) -

ID Date Qty
100 2023-02-01 5
100 2023-02-02 3 *add row date and lowest adjacent qty
100 2023-02-03 3
100 2023-02-04 3
100 2023-02-05 3
100 2023-02-06 3 *add row date and lowest adjacent qty
100 2023-02-07 3 *add row date and lowest adjacent qty
100 2023-02-08 3
100 2023-02-09 11
100 2023-02-10 11
100 2023-02-11 10
100 2023-02-12 0 *add row date and lowest adjacent qty
100 2023-02-13 0

did refer to existing answer but it doesnt fulfill my requirement of filling the lowest adjacent value(PySpark generate missing dates and fill data with previous value)

NNM
  • 358
  • 1
  • 10

2 Answers2

1

Based on comment above, given at least one adjacent value is present:

import datetime
import pyspark.sql.functions as F
from pyspark.sql.window import Window

date_list = [[(datetime.date(2023, 2, 1) + datetime.timedelta(days=x)).strftime("%Y-%m-%d")] for x in range(13)]
df_dt = spark.createDataFrame(date_list, ['date'])

tbl = [
  (100,'2023-02-01',5)
, (100,'2023-02-03',3)
, (100,'2023-02-04',3)
, (100,'2023-02-05',3)
, (100,'2023-02-08',3)
, (100,'2023-02-09',11)
, (100,'2023-02-10',11)
, (100,'2023-02-11',10)
, (100,'2023-02-13',0)
]

df_data = spark.createDataFrame(tbl, ['id','date','qty'])


df_ext = df_dt.join(df_data, on='date', how='left').orderBy('id','date')

window_spec = Window.orderBy('date') 

df_lead_lag = df_ext.withColumn(
    'lag', F.lag('qty', 1).over(window_spec)
  ).withColumn(
    'lead', F.lead('qty', 1).over(window_spec)
  )

df_result = df_lead_lag.withColumn('new_qty', F.coalesce(F.least('lag','lead'), 'qty'))

df_result.show()
Chris
  • 474
  • 3
  • 7
  • Note also spending on the ID field, you may need to fill this in and use a `partitionBy('id')` in your Window Spec. – Chris Feb 27 '23 at 16:46
  • thanks Chris. This will work for up to 2 missing dates. thanks for sharing but I would want to have to more scalable solution. – NNM Feb 27 '23 at 18:36
1

Check out the following solution and let me know what you think:

import pyspark.sql.functions as f
from pyspark.sql.types import *
import datetime

tbl = [
  (100,'2023-02-01',5)
, (100,'2023-02-03',3)
, (100,'2023-02-04',3)
, (100,'2023-02-05',3)
, (100,'2023-02-08',3)
, (100,'2023-02-09',11)
, (100,'2023-02-10',11)
, (100,'2023-02-11',10)
, (101,'2023-02-13',0) #I've added one extra id to test multiple ids scenario.
]
df = spark.createDataFrame(tbl, ['id','date','qty'])

date_range = (
    df
    .select(
        f.min('date').alias('min_date'),
        f.max('date').alias('max_date')
    )
    .collect()[0]
)
min_date, max_date = datetime.datetime.strptime(date_range.min_date, '%Y-%m-%d'), datetime.datetime.strptime(date_range.max_date, '%Y-%m-%d')
df_date_range = spark.createDataFrame([((min_date + datetime.timedelta(days = element)).strftime('%Y-%m-%d'),) for element in range((max_date - min_date).days + 1)], ['date'])

df_id = (
    df
    .select('id')
    .distinct()
)

df_output = (
    df_id.crossJoin(df_date_range)
    .join(df.alias('df'), ['id', 'date'], 'left')
    .select('id', 'date', 'df.qty')
    .withColumn('potential_preceding_qty', f.last(f.col('qty'), ignorenulls = True).over(Window.partitionBy('id').orderBy('date').rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)))
    .withColumn('potential_following_qty', f.first(f.col('qty'), ignorenulls = True).over(Window.partitionBy('id').orderBy('date').rowsBetween(Window.currentRow + 1, Window.unboundedFollowing)))
    .withColumn('qty', f.coalesce(f.col('qty'), f.least(f.col('potential_preceding_qty'), f.col('potential_following_qty'))))
    .drop('potential_preceding_qty', 'potential_following_qty')
)
ARCrow
  • 1,360
  • 1
  • 10
  • 26