3

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053', example:

df = spark.createDataFrame([
(1, "202001"), 
(2, "202002"), 
(3, "202003"), 
(4, "202052"), 
(5, "202053")
], ['id', 'week_year'])

df.withColumn("date", F.to_date(F.col("week_year"), "yyyyw")).show()

I can't figure out what the error is or how to fix these weeks. How can I convert weeks 202001 and 202053 to a valid date?

ZygD
  • 22,092
  • 39
  • 79
  • 102
Ceeeees
  • 43
  • 6

3 Answers3

3

Dealing with ISO week in Spark is indeed a headache - in fact this functionality was deprecated (removed?) in Spark 3. I think using Python datetime utilities within a UDF is a more flexible way to do this.

import datetime
import pyspark.sql.functions as F

@F.udf('date')
def week_year_to_date(week_year):
    # the '1' is for specifying the first day of the week
    return datetime.datetime.strptime(week_year + '1', '%G%V%u')

df = spark.createDataFrame([
(1, "202001"), 
(2, "202002"), 
(3, "202003"), 
(4, "202052"), 
(5, "202053")
], ['id', 'week_year'])

df.withColumn("date", week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year|      date|
+---+---------+----------+
|  1|   202001|2019-12-30|
|  2|   202002|2020-01-06|
|  3|   202003|2020-01-13|
|  4|   202052|2020-12-21|
|  5|   202053|2020-12-28|
+---+---------+----------+
mck
  • 40,932
  • 13
  • 35
  • 50
  • Thanks, this helped me on my way! And as this is the preferred way i'll mark yours as answer. Unfortunatly we're running Python version 3.5.2 on the cluster, so I had to revert to a much uglier solution. I'll add mine in another answer. – Ceeeees Jan 17 '21 at 20:45
1

Based on mck's answer this is the solution I ended up using for Python version 3.5.2 :

import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F

@F.udf('date')
def week_year_to_date(week_year):
    # the '1' is for specifying the first day of the week
    return datetime.datetime.strptime(week_year + '1', '%Y%W%w') - relativedelta(weeks = 1)

df = spark.createDataFrame([
(9, "201952"), 
(1, "202001"), 
(2, "202002"), 
(3, "202003"), 
(4, "202052"), 
(5, "202053")
], ['id', 'week_year'])

df.withColumn("date", week_year_to_date('week_year')).show()

Without the use of the in 3.6 added '%G%V%u' I had to subtract a week from the date to get the correct dates.

Ceeeees
  • 43
  • 6
0

The following will not use udf, but instead, a more efficient vectorized pandas_udf:

import pandas as pd
@F.pandas_udf('date')
def week_year_to_date(week_year: pd.Series) -> pd.Series:
    return pd.to_datetime(week_year + '1', format='%G%V%u')

df.withColumn('date', week_year_to_date('week_year')).show()
# +---+---------+----------+
# | id|week_year|      date|
# +---+---------+----------+
# |  1|   202001|2019-12-30|
# |  2|   202002|2020-01-06|
# |  3|   202003|2020-01-13|
# |  4|   202052|2020-12-21|
# |  5|   202053|2020-12-28|
# +---+---------+----------+
ZygD
  • 22,092
  • 39
  • 79
  • 102