im trying to fill missing timestamp using pyspark in aws Glue.
My raw data's date cloumns format is like 20220202 I want to convert 20220202 to 2022-02-02. so, i used the code like this. (There are 5 columns.
(1)'date' is date column(like 20220202),
(2)'sku' is categorical data like A,B,C..and it has 25 different values and each sku has their own timestamp,
(3)'unitprice' is numeric data and each sku has different unitprice. Forexample, if sku A has unitprice 30 and sku A has 300 rows in dataframe, 300 rows have same unitprice. However sku B has different unitprice.
(4) 'trand_item' is categorical data. It's kind of metadata of sku like color. It is just categorical data and same condition of (3)
(5) 'target' is numeric data and each row has different value.
When we fill missing timestamp, i want to fill timestamp per day and i want same value of 'unitprice', 'trand_item' for each SKU but want to fill 0 in target when we add new rows for new timestamp.
sparkDF = sparkDF.select('date', 'sku', 'unitprice', 'trand_item', 'target')
sparkDF = sparkDF.withColumn("date",sparkDF["date"].cast(StringType()))
sparkDF = sparkDF.withColumn("date", to_date(col("date"), "yyyymmdd"))
In data, there is 'sku' column. This column is categorical data and it has 25 different values like A,B,C... Each value has their own timestamp and each value's starting date is different.(ending date is same.)
sparkDF = sparkDF.dropDuplicates(['date', 'sku'])
sparkDF = sparkDF.sort("sku", "date")
Each sku(we have 25 sku in data) has their own timestamp and has missing timestamp, so i want to fill it. How can i handle this?
<sample data>
date sku unitprice trand_item target
2018-01-01 A 10 Black 3
2018-02-01 A 10 Black 7
2018-04-01 A 10 Black 13
2017-08-01 B 20 White 4
2017-10-01 B 20 White 17
2017-11-01 B 20 White 9
<output i want>
date sku unitprice trand_item target
2018-01-01 A 10 Black 3
2018-02-01 A 10 Black 7
2018-03-01 A 10 Black 0
2018-04-01 A 10 Black 13
2017-08-01 B 20 White 4
2017-09-01 B 20 White 0
2017-10-01 B 20 White 17
2017-11-01 B 20 White 9