0

We are migrating a huge codebase from Spark2 to Spark 3.x. In order to make the migration incrementally, some configs were set to legacy to have the same behavior as in Spark 2.x. The function add_months, however, AFAIK does not have a "legacy" mode. In Spark3 according to the migration docs

In Spark 3.0, the add_months function does not adjust the resulting date to a last day of month if the original date is a last day of months. For example, select add_months(DATE'2019-02-28', 1) results 2019-03-28. In Spark version 2.4 and below, the resulting date is adjusted when the original date is a last day of months. For example, adding a month to 2019-02-28 results in 2019-03-31.

While Spark 2.x adjusts the resulting date to the last day of the month. The obvious solution would be to write a wrapper around it but I wonder if there is any configuration in Spark3 to get add_months Spark2 behavior.

EDIT:

I ended up implementing a wrapper to add_months in Scala Spark 3.x:

object functions {
  def add_months(startDate: Column, numMonths: Int): Column = add_months(startDate, lit(numMonths))
  def add_months(startDate: Column, numMonths: Column): Column = {
    val addedMonthsSpark   = add_months_spark(startDate, numMonths)
    val startDateIsLastDay = last_day(startDate) === startDate
    when(startDateIsLastDay, last_day(addedMonthsSpark)).otherwise(addedMonthsSpark)
  }
}
Diego
  • 1
  • 2
  • Hi Diego! Welcome :) Could you provide a code example of your current usage of `add_months`? This will help others to provide a solution. – tjheslin1 Oct 16 '21 at 10:56
  • Please provide enough code so others can better understand or reproduce the problem. – Community Oct 16 '21 at 10:57
  • @tjheslin1 Thank you :) The usage of add_months is usually something like add_months(column_with_start_date, number_of_months_to_add). I wanted to know if there is a configuration to get the same behavior as in Spark 2.x keeping the same API. At the end of the day, I decided to implement a wrapper that simulates the behavior of add_months in Spark 2.x. It is not the ideal solution, but works. – Diego Oct 26 '21 at 07:03

1 Answers1

0

Here is a Python implementation of the wrapper you mentioned.

def add_months(start_date: str or Column, num_months: int):
    if isinstance(start_date, str):
        start_date = f.col(start_date)

    add_months_spark = f.add_months(start_date, num_months)
    start_date_is_last_day = f.last_day(start_date) == start_date

    return f.when(
        start_date_is_last_day,
        f.last_day(add_months_spark)
    ).otherwise(add_months_spark)

Also, it is possible to avoid using isinstance by utilizing singledispatch in order to overload start_date.

S.B
  • 13,077
  • 10
  • 22
  • 49
Eugene
  • 1
  • 1