0

Having a date, I create a column with ISO 8601 week date format:

from pyspark.sql import functions as F
df = spark.createDataFrame([('2019-03-18',), ('2019-12-30',), ('2022-01-03',), ('2022-01-10',)], ['date_col'])

df = df.withColumn(
    'iso_from_date',
    F.concat_ws(
        '-',
        F.expr('extract(yearofweek from date_col)'),
        F.lpad(F.weekofyear('date_col'), 3, 'W0'),
        F.expr('weekday(date_col) + 1')
    )
)
df.show()
# +----------+-------------+
# |  date_col|iso_from_date|
# +----------+-------------+
# |2019-03-18|   2019-W12-1|
# |2019-12-30|   2020-W01-1|
# |2022-01-03|   2022-W01-1|
# |2022-01-10|   2022-W02-1|
# +----------+-------------+

Using Spark 3, how to get back the date, given ISO 8601 week date?

I tried the following, but it is both, incorrect and uses LEGACY configuration which I don't like.

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
df.withColumn('date_from_iso', F.to_date('iso_from_date', "YYYY-'W'ww-uu")).show()
# +----------+-------------+-------------+
# |  date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18|   2019-W12-1|   2019-03-18|
# |2019-12-30|   2020-W01-1|   2019-12-30|
# |2022-01-03|   2022-W01-1|   2021-12-27|
# |2022-01-10|   2022-W02-1|   2022-01-03|
# +----------+-------------+-------------+

I am aware of the possibility to create a udf which works:

import datetime
@F.udf('date')
def iso_to_date(iso_date):
    return datetime.datetime.strptime(iso_date, '%G-W%V-%u')

df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()

But I am looking for a more efficient option. The ideal option should not use LEGACY configuration and be translatable to SQL or Scala (no inefficient udf).

ZygD
  • 22,092
  • 39
  • 79
  • 102
  • This format will convert the 2019's date correctly. `YYYY-'W'ww-uu`. However I think the 2022's week number is not correct according to https://www.timeanddate.com/date/weeknumber.html. Spark's offset of weekofyear is different? – Emma May 06 '22 at 15:09
  • @Emma - thanks for the response. This is a very interesting observation! I really don't know why doesn't it work. As can be seen, Spark perfectly forms ISO format from the date. So at least in that part it knows how to do the calculations... I will edit my question using your format, as it seems more appropriate than mine, but still does not work. – ZygD May 06 '22 at 16:55
  • I also found https://mungingdata.com/apache-spark/week-end-start-dayofweek-next-day/ For `weekofyear` function, "Week starts with Monday and ends with Sunday (other Spark functions assume week starts with Sunday and ends with Saturday)" I think the behavior you are seeing is something related to this. – Emma May 06 '22 at 17:21

1 Answers1

0

In PySpark, I have found a nicer than udf option. This will use pandas_udf which is vectorized (more efficient):

import pandas as pd
@F.pandas_udf('date')
def iso_to_date(iso_date: pd.Series) -> pd.Series:
    return pd.to_datetime(iso_date, format='%G-W%V-%u')

df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
# +----------+-------------+-------------+
# |  date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18|   2019-W12-1|   2019-03-18|
# |2019-12-30|   2020-W01-1|   2019-12-30|
# |2022-01-03|   2022-W01-1|   2022-01-03|
# |2022-01-10|   2022-W02-1|   2022-01-10|
# +----------+-------------+-------------+

It works in Spark 3 without the LEGACY configuration. So it's acceptable.

However, there is room for improvement, as this option is not transferable to SQL or Scala.

ZygD
  • 22,092
  • 39
  • 79
  • 102