0

im trying to fill missing timestamp using pyspark in aws Glue.

My raw data's date cloumns format is like 20220202 I want to convert 20220202 to 2022-02-02. so, i used the code like this. (There are 5 columns.

(1)'date' is date column(like 20220202),

(2)'sku' is categorical data like A,B,C..and it has 25 different values and each sku has their own timestamp,

(3)'unitprice' is numeric data and each sku has different unitprice. Forexample, if sku A has unitprice 30 and sku A has 300 rows in dataframe, 300 rows have same unitprice. However sku B has different unitprice.

(4) 'trand_item' is categorical data. It's kind of metadata of sku like color. It is just categorical data and same condition of (3)

(5) 'target' is numeric data and each row has different value.

When we fill missing timestamp, i want to fill timestamp per day and i want same value of 'unitprice', 'trand_item' for each SKU but want to fill 0 in target when we add new rows for new timestamp.

sparkDF = sparkDF.select('date', 'sku', 'unitprice', 'trand_item', 'target')

sparkDF = sparkDF.withColumn("date",sparkDF["date"].cast(StringType()))

sparkDF = sparkDF.withColumn("date", to_date(col("date"), "yyyymmdd"))

In data, there is 'sku' column. This column is categorical data and it has 25 different values like A,B,C... Each value has their own timestamp and each value's starting date is different.(ending date is same.)

sparkDF = sparkDF.dropDuplicates(['date', 'sku'])

sparkDF = sparkDF.sort("sku", "date")

Each sku(we have 25 sku in data) has their own timestamp and has missing timestamp, so i want to fill it. How can i handle this?

<sample data>

   date      sku   unitprice    trand_item    target
2018-01-01    A      10            Black        3
2018-02-01    A      10            Black        7
2018-04-01    A      10            Black        13
2017-08-01    B      20            White        4
2017-10-01    B      20            White        17
2017-11-01    B      20            White        9



<output i want>
   date      sku   unitprice    trand_item    target
2018-01-01    A      10            Black        3
2018-02-01    A      10            Black        7
2018-03-01    A      10            Black        0
2018-04-01    A      10            Black        13
2017-08-01    B      20            White        4
2017-09-01    B      20            White        0
2017-10-01    B      20            White        17
2017-11-01    B      20            White        9
daniel lee
  • 33
  • 1
  • 5
  • Can you provide a brief example of some input data and expected output which covers the scenarios you're describing? You talk about "filling missing timestamp", presumably in existing rows with no timestamp data, but also you want to "add new rows for new timestamp" which is a little confusing. – Barnesly Feb 22 '22 at 11:58
  • i added @Barnesly – daniel lee Feb 22 '22 at 12:44
  • anyone who can help me to solve this problem?? – daniel lee Feb 22 '22 at 13:37
  • Looks like the answer provided by @luiz-viola does what you need? – Barnesly Feb 22 '22 at 14:13

1 Answers1

2

Your input:

data = [('2018-01-01','A',10,'Black',3),
('2018-02-01','A',10,'Black',7),
('2018-04-01','A',10,'Black',13),
('2017-08-01','B',20,'White',4),
('2017-10-01','B',20,'White',17),
('2017-11-01','B',20,'White',9)]

cols = ['date', 'sku', 'unitprice', 'trand_item', 'target']
  
df = sqlContext.createDataFrame(data, cols)

Inspired by amazing solution from @blackbishop on PySpark generate missing dates and fill data with previous value

from pyspark.sql import functions as F
from pyspark.sql import Window

df = df.withColumn("date", F.to_date(F.col("date"), "yyyy-dd-MM"))

dates_range = df.groupBy("sku").agg(
    F.date_trunc("dd", F.max(F.col("date"))).alias("max_date"),
    F.date_trunc("dd", F.min(F.col("date"))).alias("min_date")
).select(
    "sku",
    F.expr("sequence(min_date, max_date, interval 1 day)").alias("date")
).withColumn(
    "date", F.explode("date")
).withColumn(
    "date",
    F.date_format("date", "yyyy-MM-dd")
)

w = Window.partitionBy("sku").orderBy("date")

result = dates_range\
          .join(df, ["sku", "date"], "left")\
          .select("sku","date",*[F.last(F.col(c), ignorenulls=True).over(w).alias(c)\
              for c in df.columns if c not in ("sku", "date", "target")],"target")\
          .fillna(0, subset=['target'])

result.show()

+---+----------+---------+----------+------+
|sku|      date|unitprice|trand_item|target|
+---+----------+---------+----------+------+
|  A|2018-01-01|       10|     Black|     3|
|  A|2018-01-02|       10|     Black|     7|
|  A|2018-01-03|       10|     Black|     0|
|  A|2018-01-04|       10|     Black|    13|
|  B|2017-01-08|       20|     White|     4|
|  B|2017-01-09|       20|     White|     0|
|  B|2017-01-10|       20|     White|    17|
|  B|2017-01-11|       20|     White|     9|
+---+----------+---------+----------+------+

Luiz Viola
  • 2,143
  • 1
  • 11
  • 30