Resampling datetime by date in pyspark

Question

I am trying to use resample technique of pandas in pyspark but can't come to any conclusion.

+----------------------------+----+
|               date         | val|
+-----+---------------------------+
|2022-03-19T00:00:00.000+0000|14.0|
|2022-03-16T00:00:00.000+0000| 9.5|
|2022-03-13T00:00:00.000+0000|14.0|
|2022-03-10T00:00:00.000+0000| 1.0|
|2022-03-08T00:00:00.000+0000|24.0|
+-----+-------------------+----+--+

I have a dataset like above. However, I would like to have a dataset resampled in pyspark, something like this:

+----------------------------+----+
|               date         | val|
+-----+---------------------------+
|2022-03-19T00:00:00.000+0000|14.0|
|2022-03-18T00:00:00.000+0000|14.0|
|2022-03-17T00:00:00.000+0000|14.0|
|2022-03-16T00:00:00.000+0000| 9.5|
|2022-03-15T00:00:00.000+0000| 9.5|
|2022-03-14T00:00:00.000+0000| 9.5|
|2022-03-13T00:00:00.000+0000|14.0|
|2022-03-12T00:00:00.000+0000|14.0|
|2022-03-11T00:00:00.000+0000|14.0|
|2022-03-10T00:00:00.000+0000| 1.0|
|2022-03-09T00:00:00.000+0000| 1.0|
|2022-03-08T00:00:00.000+0000|24.0|
+-----+-------------------+----+--+

Goal is to have the dates in sequence and filled with the missing dates.

New val column values should be filled with values of the previous column that already exists.

score 0 · Answer 1 · answered Mar 28 '22 at 14:27

You can first get the previous day available for every row and add 1 day to have the next day from previous day.

Then generate a sequence and explode:

from pyspark.sql import functions as F, Window as W
w = W.orderBy(F.desc("date")) #add .partitionBy(partitioncolumn)
out = df.withColumn("prev_",F.date_add(F.lead("date").over(w),1))\
.withColumn("NewDate",
            F.explode_outer(F.expr("sequence(date,prev_)"))
           )\
.withColumn("date",F.coalesce("NewDate","date")).select(*df.columns)

out.show(truncate=False)

+-------------------+----+
|date               |val |
+-------------------+----+
|2022-03-19 05:30:00|14.0|
|2022-03-18 05:30:00|14.0|
|2022-03-17 05:30:00|14.0|
|2022-03-16 05:30:00|9.5 |
|2022-03-15 05:30:00|9.5 |
|2022-03-14 05:30:00|9.5 |
|2022-03-13 05:30:00|14.0|
|2022-03-12 05:30:00|14.0|
|2022-03-11 05:30:00|14.0|
|2022-03-10 05:30:00|1.0 |
|2022-03-09 05:30:00|1.0 |
|2022-03-08 05:30:00|24.0|
+-------------------+----+

combination of 2022-03-19 00:00:00 & 2022-03-18 00:00:00 dates along with the other row values getting repeated rather than resulting different dates. — sargupta, Mar 28 '22 at 15:19
@sargupta idea is take the day next to the available previous date (using lag) so as to avoid overlapping of the dates. I tested with replacing `2022-03-16T00:00:00.000+0000` with `2022-03-18T00:00:00.000+0000` in my system and it works. Make sure the datatypes are dates etc maybe — anky, Mar 28 '22 at 15:36

Resampling datetime by date in pyspark

1 Answers1

Linked