2

Given todays date, I want to compute the first N upcoming Sundays and put them in a Dataframe. Assume todays date is 2023-05-02 and N = 3, the output should look as follows:

Date
2023-05-07
2013-05-14
2013-05-21

I tried the following:

from pyspark.sql import functions as f
from pyspark.sql.types import DateType

current_date = f.current_date()
day_offset = datetime.timedelta(6 - current_date.weekday())
first_sunday = current_date + day_offset

dates = []
for weeks in range(3):
    days = weeks * 7
    dates.append(first_sunday + datetime.timedelta(days=days))

out = spark.createDataFrame(dates, DateType()).toDF("Date")

This gives me the following error:

TypeError: 'Column' object is not callable

I know I can get todays date using datetime.date.today(), but I specifically want to use f.current_date() because I have to @patch it later to make my tests run on a fixed date, and I can't do date with datetime.date.today().

Thanks in advance!

Derek O
  • 16,770
  • 4
  • 24
  • 43
Peter
  • 722
  • 6
  • 24
  • 1
    If you are going to use this `upcoming sunday` valuue inside a DataFrame, please provide us a sample input and output. – arudsekaberne May 02 '23 at 02:52
  • 1
    Have you checked https://stackoverflow.com/questions/8708058/python-given-a-date-and-weekday-find-the-date-of-the-next-occurrence-of-a-given ? – Raid May 02 '23 at 03:34
  • @arudsekaberne I provided it. Could you look at the question and try to answer it if you can? – Peter May 02 '23 at 07:28
  • @Raid yes, but that does not confirm with my requirements – Peter May 02 '23 at 07:28
  • 1
    Why not using `expr` and a custom expression? Would try to give an answer... – Memristor May 04 '23 at 22:59

3 Answers3

1

There goes my attempt, probably you should simplify it:

from pyspark.sql.functions import current_date, date_add, expr

# I don't include the pyspark session creation (spark)

days_to_first_sunday = expr(
  "7 - IF(DAYOFWEEK(current_date()) = 1, 0, DAYOFWEEK(current_date()) - 1)"
)
days_to_first_sunday_df = spark.range(1).select(days_to_first_sunday.alias(
  "days_to_first_sunday")
)

first_sunday_df = days_to_first_sunday_df.select(
  date_add(current_date(), "days_to_first_sunday").alias("Date")
)

N = 3
upcoming_sundays = [date_add(first_sunday_df.columns[0], 7 * i) for i in range(N)]

sundays_df = first_sunday_df.select(
    *upcoming_sundays
).toDF(
    *[f"Date_{i}" for i in range(N)]
).selectExpr(
    f"stack({str(N)}, 'Date_0', Date_0, 'Date_1', Date_1, 'Date_2', Date_2) as (FieldName, Date)"
).drop("FieldName")

sundays_df.show()

being the output:

+----------+
|      Date|
+----------+
|2023-05-07|
|2023-05-14|
|2023-05-21|
+----------+
Memristor
  • 599
  • 4
  • 11
0

Since f.current_date creates a column object, I think it makes sense to start with a pyspark dataframe already in place – we can put [1,2,3] into a column called upcoming_sunday_number, and then use f.current_date in the following way:

from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType

# note: using f.dayofweek --> sunday is 1

N = 3

spark.createDataFrame(list(range(1,N+1)), "integer").toDF("upcoming_sunday_number").withColumn(
    "current_date", f.current_date()
).withColumn(
    "number_of_days", (1-f.dayofweek(f.current_date()) + 7*f.col('upcoming_sunday_number')).cast(IntegerType())
).withColumn(
    "upcoming_sunday_date", f.date_add(f.current_date(), f.col("number_of_days"))
).show()

+----------------------+------------+--------------+--------------------+
|upcoming_sunday_number|current_date|number_of_days|upcoming_sunday_date|
+----------------------+------------+--------------+--------------------+
|                     1|  2023-05-04|             3|          2023-05-07|
|                     2|  2023-05-04|            10|          2023-05-14|
|                     3|  2023-05-04|            17|          2023-05-21|
+----------------------+------------+--------------+--------------------+

Or more succinctly:

spark.createDataFrame(list(range(1,N+1)), "integer").toDF("upcoming_sunday_number").withColumn(
    "upcoming_sunday_date", f.date_add(f.current_date(), (1-f.dayofweek(f.current_date()) + 7*f.col('upcoming_sunday_number')).cast(IntegerType()))
)

+----------------------+--------------------+
|upcoming_sunday_number|upcoming_sunday_date|
+----------------------+--------------------+
|                     1|          2023-05-07|
|                     2|          2023-05-14|
|                     3|          2023-05-21|
+----------------------+--------------------+

Note that in my original answer, i was a little careless and made upcoming_sunday_number a string instead of an integer, but this should still compile with the correct end result:

root
 |-- upcoming_sunday_number: string (nullable = true)
 |-- current_date: date (nullable = false)
 |-- number_of_days: integer (nullable = true)
 |-- upcoming_sunday_date: date (nullable = true)
Derek O
  • 16,770
  • 4
  • 24
  • 43
  • This gives the same error as the other answers, you try to add a column in date_addd() where an int is expected? – Peter May 05 '23 at 09:39
  • @Peter it looks like you've already found a suitable answer, but it might be worthwhile to dig into why you're getting an error. `f.date_add(f.current_date(), f.col("number_of_days"))` should work because `f.col("number_of_days")` is an `IntegerType` – i explicitly cast this column type to ensure consistency. that being said, i just realized that `upcoming_sunday_number` should probably be an integer and not a string (but even this compiles correctly, and i can't reproduce the error you're getting). what version of spark are you using and are you running the exact same code in my answer? – Derek O May 05 '23 at 13:41
-1
from pyspark.sql.functions import date_add, date_sub, col, dayofweek, lit
from pyspark.sql.types import DateType

today = date_sub(current_date(), 0)
days_until_sunday = 6 - dayofweek(today)
next_sunday = date_add(today, days_until_sunday)

N = 3
dates = []
for i in range(N):
    date = date_add(next_sunday, i * 7)
    dates.append(date)

df = spark.createDataFrame([(date,) for date in dates], ["date"])
df.show()
Deven Ramani
  • 751
  • 4
  • 10
  • This does not work. in the date_add, you try to add a column where an int is expected (the days_until_sunday) – Peter May 02 '23 at 10:48