0

Suppose we have a PySpark dataframe that looks like below

partition min_date max_date
A 01/01/2020 01/01/2021
B 01/06/2021 01/06/2022

I want to add a column, say date_list, that contains a list of dates between the min_date and max_date (inclusive) for a given (pandas) frequency.

After some investigation I opted that a UDF would be the best way to achieve this.

def create_date_range_column_factory(frequency: str):
    """Returns a udf that generates a list of dates between min and max with a given frequency"""
    def create_date_range_column(min_date: datetime.datetime, max_date: datetime.datetime) -> List[datetime.datetime]:
        # Using to to_pydatetime() as returning pandas timestamp gives an error.
        return [pdt.to_pydatetime() for pdt in pd.date_range(min_date, max_date, freq=frequency)]

    return sf.udf(create_date_range_column, st.ArrayType(st.TimestampType()))

# Create function that generates a list with month start frequency
date_range_udf = create_date_range_column_factory("M")

sdf = sdf.withColumn("date_list", date_range_udf(sf.col("min_date"), sf.col("max_date")))

df = sdf.toPandas()

# Just to get an easy to read output
output_list = [str(x) for x in df['date_list'][0]]

output_list then looks like:

['2020-01-31 01:00:00', '2020-02-29 01:00:00', '2020-03-31 01:00:00', '2020-04-30 01:00:00', '2020-05-31 01:00:00', '2020-06-30 01:00:00', '2020-07-31 01:00:00', '2020-08-31 01:00:00', '2020-09-30 01:00:00', '2020-10-31 01:00:00', '2020-11-30 01:00:00', '2020-12-31 01:00:00']

Obviously, this is not what I expected. I've tried setting tz=pytz.utc in pd.date_range but that output puzzles me:

['2020-01-31 02:00:00', '2020-02-29 02:00:00', '2020-03-31 03:00:00', '2020-04-30 03:00:00', '2020-05-31 03:00:00', '2020-06-30 03:00:00', '2020-07-31 03:00:00', '2020-08-31 03:00:00', '2020-09-30 03:00:00', '2020-10-31 02:00:00', '2020-11-30 02:00:00', '2020-12-31 02:00:00']

I've set spark.sql.session.timeZone to UTC

I'm using Spark 3.1.2 and python 3.9.

What else I tried

  • Passing pandas dataframe with tz aware timestamps and creating a Spark dataframe from there
  • Passing strings and converting them in Spark

Question

Probably something is going wrong converting the date range generated by the UDF back to PySpark TimeStampType, however I can't figure out what I'm doing wrong. Any ideas/input is greatly appreciated.

Obviously, if anyone has a better way of achieving this (withoud udf) that is also welcome.

Joep Atol
  • 63
  • 6
  • why not use [`sequence()`](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.sequence.html) to get the list of dates? see [this example](https://stackoverflow.com/q/72885817/8279585) – samkart Jul 07 '22 at 10:00
  • Great suggestion @samkart thanks! However, it gives the same output. Maybe my spark settings are not correct? – Joep Atol Jul 07 '22 at 11:00
  • quite odd. how do your `min_date` and `max_date` look like? – samkart Jul 07 '22 at 11:05
  • ```min_date``` & ```max_date``` look like (after doing toPandas()): 2020-01-01 00:00:00. Tzinfo on all dates is ```None```. I did notice that ```min_date``` & ```max_date``` are of type ```Timestamp```, while the dates in ```output_list``` are of type ```datetime```. – Joep Atol Jul 07 '22 at 11:17
  • if you're using `sequence()`, check the format in spark and do a `.show()` to check how they look. you can do away with the time part probably -- so, cast it as date. – samkart Jul 07 '22 at 11:22
  • In spark the times are okay, all times are 00:00:00. Very strange.. Casting to date might be a workable solution, will give that a try. – Joep Atol Jul 07 '22 at 11:45

1 Answers1

0

Are these dates in d/M/y or M/d/y format? Seems that spark is not parsing these dates correctly

Here I suggest you another approach, to avoid using UDFs and working with pandas, that you might know leads to oom errors if you are working with big data. You might also try this code to check if your environment is returning these expected outputs. I used a small date range so the results can be more visible

#exemple df:

df = spark.createDataFrame(
    [
    ('A','01/01/2020','05/01/2020'),
    ('B','02/06/2021','04/06/2021')
    ],
    ["partition", "min_date", "max_date"]
)

df.show()

+---------+----------+----------+
|partition|  min_date|  max_date|
+---------+----------+----------+
|        A|01/01/2020|05/01/2020|
|        B|02/06/2021|04/06/2021|
+---------+----------+----------+

Here are the steps performed:

1 - transform min_date and max_date to date format

2 - calculate the unix_timestamp time difference between these two dates. Divide by 86400 so we can have the delta in days (1 day = 86400s)

3 - create a list with ','s, one for each day, explode and then plot the dates

4 - group by partition, min and max dates. Then collect a list for the datas in this group by

import pyspark.sql.functions as F

df\
        .withColumn('min_date',F.to_date(F.col('min_date'), 'd/M/y'))\
        .withColumn('max_date',F.to_date(F.col('max_date'), 'd/M/y'))\
        .withColumn("timedelta", (F.unix_timestamp('max_date') - F.unix_timestamp('min_date'))/(86400))\
        .withColumn("repeat", F.expr("split(repeat(',', timedelta), ',')"))\
        .select("*", F.posexplode("repeat").alias("days_count", "val"))\
        .withColumn("interval_date_time_exp", (F.unix_timestamp("min_date") + F.col("days_count")*86400).cast('timestamp'))\
        .groupby('partition', 'min_date', 'max_date').agg(F.collect_list('interval_date_time_exp'))\
        .show(truncate = False)

+---------+----------+----------+---------------------------------------------------------------------------------------------------------+
|partition|min_date  |max_date  |collect_list(interval_date_time_exp)                                                                     |
+---------+----------+----------+---------------------------------------------------------------------------------------------------------+
|A        |2020-01-01|2020-01-05|[2020-01-01 00:00:00, 2020-01-02 00:00:00, 2020-01-03 00:00:00, 2020-01-04 00:00:00, 2020-01-05 00:00:00]|
|B        |2021-06-02|2021-06-04|[2021-06-02 00:00:00, 2021-06-03 00:00:00, 2021-06-04 00:00:00]                                          |
+---------+----------+----------+---------------------------------------------------------------------------------------------------------+

Hope this is the outcome you are looking for

Luiz Viola
  • 2,143
  • 1
  • 11
  • 30
  • I realised you wanted the delta in months. But given those datas, I did not get what do you expect. The first day of the month? Can you plot the expected output for you exemple df? – Luiz Viola Jul 07 '22 at 12:09