Resample months to days in PySpark

Question

I've got a DataFrame like this:

+-------------------+-----------+-------------+
|             months|       type|summaoborotdt|
+-------------------+-----------+-------------+
|2022-01-01 00:00:00|   schet_21|    131329.55|
|2022-01-01 00:00:00|   schet_22|       7716.1|
|2022-01-01 00:00:00|   schet_23|     23883.65|
|2022-01-01 00:00:00|   schet_24|    131214.84|
|2022-01-01 00:00:00|   schet_25|      5129.21|
|2022-01-01 00:00:00|   schet_26|     15651.74|
|2022-01-01 00:00:00|   schet_27|      1700.01|
|2022-01-01 00:00:00|   schet_28|       3992.0|
|2022-01-01 00:00:00|   schet_29|     16601.33|
|2022-01-01 00:00:00|   schet_30|     27939.84|
+-------------------+-----------+-------------+

How can I resample dataframe to days with filled column summaoborotdt/num of days in month?

In Pandas, I could use df.resample('D').ffill(), but there is no such function in PySpark.

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community, Sep 24 '22 at 15:21

score 0 · Answer 1 · answered Sep 24 '22 at 15:32

In Spark this may be more difficult, as you don't have an index on which you could do resampling. To create days from months you could do these steps:

use window function lead to find the next date
create an array with days range until that date
explode the array

Example input:

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
    [('2022-02-01 00:00:00', 'schet_21', 131329.55),
     ('2022-03-01 00:00:00', 'schet_22', 7716.1)],
    ['months', 'type', 'summaoborotdt']
).withColumn('months', F.to_timestamp('months'))

Script:

last_day = F.date_sub(F.lead('months').over(W.orderBy('months')), 1)
df = df.select(
    F.sequence('months', F.coalesce(last_day, 'months')).alias('days'),
    *[c for c in df.columns if c != 'months']
).withColumn('days', F.explode('days'))

Result:

df.show(99)
# +-------------------+--------+-------------+
# |               days|    type|summaoborotdt|
# +-------------------+--------+-------------+
# |2022-02-01 00:00:00|schet_21|    131329.55|
# |2022-02-02 00:00:00|schet_21|    131329.55|
# |2022-02-03 00:00:00|schet_21|    131329.55|
# |2022-02-04 00:00:00|schet_21|    131329.55|
# |2022-02-05 00:00:00|schet_21|    131329.55|
# |2022-02-06 00:00:00|schet_21|    131329.55|
# |2022-02-07 00:00:00|schet_21|    131329.55|
# |2022-02-08 00:00:00|schet_21|    131329.55|
# |2022-02-09 00:00:00|schet_21|    131329.55|
# |2022-02-10 00:00:00|schet_21|    131329.55|
# |2022-02-11 00:00:00|schet_21|    131329.55|
# |2022-02-12 00:00:00|schet_21|    131329.55|
# |2022-02-13 00:00:00|schet_21|    131329.55|
# |2022-02-14 00:00:00|schet_21|    131329.55|
# |2022-02-15 00:00:00|schet_21|    131329.55|
# |2022-02-16 00:00:00|schet_21|    131329.55|
# |2022-02-17 00:00:00|schet_21|    131329.55|
# |2022-02-18 00:00:00|schet_21|    131329.55|
# |2022-02-19 00:00:00|schet_21|    131329.55|
# |2022-02-20 00:00:00|schet_21|    131329.55|
# |2022-02-21 00:00:00|schet_21|    131329.55|
# |2022-02-22 00:00:00|schet_21|    131329.55|
# |2022-02-23 00:00:00|schet_21|    131329.55|
# |2022-02-24 00:00:00|schet_21|    131329.55|
# |2022-02-25 00:00:00|schet_21|    131329.55|
# |2022-02-26 00:00:00|schet_21|    131329.55|
# |2022-02-27 00:00:00|schet_21|    131329.55|
# |2022-02-28 00:00:00|schet_21|    131329.55|
# |2022-03-01 00:00:00|schet_22|       7716.1|
# +-------------------+--------+-------------+

thx it does work, sorry cant upvote ( for sure i`ll do it later) — Alex Smolyakov, Sep 24 '22 at 18:15

Resample months to days in PySpark

1 Answers1