2

I am trying to use resample technique of pandas in pyspark but can't come to any conclusion.

+----------------------------+----+
|               date         | val|
+-----+---------------------------+
|2022-03-19T00:00:00.000+0000|14.0|
|2022-03-16T00:00:00.000+0000| 9.5|
|2022-03-13T00:00:00.000+0000|14.0|
|2022-03-10T00:00:00.000+0000| 1.0|
|2022-03-08T00:00:00.000+0000|24.0|
+-----+-------------------+----+--+

I have a dataset like above. However, I would like to have a dataset resampled in pyspark, something like this:

+----------------------------+----+
|               date         | val|
+-----+---------------------------+
|2022-03-19T00:00:00.000+0000|14.0|
|2022-03-18T00:00:00.000+0000|14.0|
|2022-03-17T00:00:00.000+0000|14.0|
|2022-03-16T00:00:00.000+0000| 9.5|
|2022-03-15T00:00:00.000+0000| 9.5|
|2022-03-14T00:00:00.000+0000| 9.5|
|2022-03-13T00:00:00.000+0000|14.0|
|2022-03-12T00:00:00.000+0000|14.0|
|2022-03-11T00:00:00.000+0000|14.0|
|2022-03-10T00:00:00.000+0000| 1.0|
|2022-03-09T00:00:00.000+0000| 1.0|
|2022-03-08T00:00:00.000+0000|24.0|
+-----+-------------------+----+--+

Goal is to have the dates in sequence and filled with the missing dates.

  • New val column values should be filled with values of the previous column that already exists.
sargupta
  • 953
  • 13
  • 25

1 Answers1

0

You can first get the previous day available for every row and add 1 day to have the next day from previous day.

Then generate a sequence and explode:

from pyspark.sql import functions as F, Window as W
w = W.orderBy(F.desc("date")) #add .partitionBy(partitioncolumn)
out = df.withColumn("prev_",F.date_add(F.lead("date").over(w),1))\
.withColumn("NewDate",
            F.explode_outer(F.expr("sequence(date,prev_)"))
           )\
.withColumn("date",F.coalesce("NewDate","date")).select(*df.columns)

out.show(truncate=False)

+-------------------+----+
|date               |val |
+-------------------+----+
|2022-03-19 05:30:00|14.0|
|2022-03-18 05:30:00|14.0|
|2022-03-17 05:30:00|14.0|
|2022-03-16 05:30:00|9.5 |
|2022-03-15 05:30:00|9.5 |
|2022-03-14 05:30:00|9.5 |
|2022-03-13 05:30:00|14.0|
|2022-03-12 05:30:00|14.0|
|2022-03-11 05:30:00|14.0|
|2022-03-10 05:30:00|1.0 |
|2022-03-09 05:30:00|1.0 |
|2022-03-08 05:30:00|24.0|
+-------------------+----+
anky
  • 74,114
  • 11
  • 41
  • 70
  • combination of 2022-03-19 00:00:00 & 2022-03-18 00:00:00 dates along with the other row values getting repeated rather than resulting different dates. – sargupta Mar 28 '22 at 15:19
  • @sargupta idea is take the day next to the available previous date (using lag) so as to avoid overlapping of the dates. I tested with replacing `2022-03-16T00:00:00.000+0000` with `2022-03-18T00:00:00.000+0000` in my system and it works. Make sure the datatypes are dates etc maybe – anky Mar 28 '22 at 15:36