Suppose we have a PySpark dataframe
that looks like below
partition | min_date | max_date |
---|---|---|
A | 01/01/2020 | 01/01/2021 |
B | 01/06/2021 | 01/06/2022 |
I want to add a column, say date_list
, that contains a list of dates between the min_date
and max_date
(inclusive) for a given (pandas) frequency.
After some investigation I opted that a UDF would be the best way to achieve this.
def create_date_range_column_factory(frequency: str):
"""Returns a udf that generates a list of dates between min and max with a given frequency"""
def create_date_range_column(min_date: datetime.datetime, max_date: datetime.datetime) -> List[datetime.datetime]:
# Using to to_pydatetime() as returning pandas timestamp gives an error.
return [pdt.to_pydatetime() for pdt in pd.date_range(min_date, max_date, freq=frequency)]
return sf.udf(create_date_range_column, st.ArrayType(st.TimestampType()))
# Create function that generates a list with month start frequency
date_range_udf = create_date_range_column_factory("M")
sdf = sdf.withColumn("date_list", date_range_udf(sf.col("min_date"), sf.col("max_date")))
df = sdf.toPandas()
# Just to get an easy to read output
output_list = [str(x) for x in df['date_list'][0]]
output_list
then looks like:
['2020-01-31 01:00:00', '2020-02-29 01:00:00', '2020-03-31 01:00:00', '2020-04-30 01:00:00', '2020-05-31 01:00:00', '2020-06-30 01:00:00', '2020-07-31 01:00:00', '2020-08-31 01:00:00', '2020-09-30 01:00:00', '2020-10-31 01:00:00', '2020-11-30 01:00:00', '2020-12-31 01:00:00']
Obviously, this is not what I expected. I've tried setting tz=pytz.utc
in pd.date_range
but that output puzzles me:
['2020-01-31 02:00:00', '2020-02-29 02:00:00', '2020-03-31 03:00:00', '2020-04-30 03:00:00', '2020-05-31 03:00:00', '2020-06-30 03:00:00', '2020-07-31 03:00:00', '2020-08-31 03:00:00', '2020-09-30 03:00:00', '2020-10-31 02:00:00', '2020-11-30 02:00:00', '2020-12-31 02:00:00']
I've set spark.sql.session.timeZone
to UTC
I'm using Spark 3.1.2 and python 3.9.
What else I tried
- Passing pandas dataframe with tz aware timestamps and creating a Spark dataframe from there
- Passing strings and converting them in Spark
Question
Probably something is going wrong converting the date range generated by the UDF back to PySpark TimeStampType, however I can't figure out what I'm doing wrong. Any ideas/input is greatly appreciated.
Obviously, if anyone has a better way of achieving this (withoud udf) that is also welcome.