1

Suppose you have a table with snapshots of accounts from the past 4 years, 1 row per date/account pair. (Note accounts don't need to have a row for every day).

For simplicity suppose there are 3 columns: acct_number, date and balance. How would you add 2 more columns to each row that are the date/balance for that account for the previous date recorded in the table?

It seems like sorting by acct_number and date then "offsetting by one" and joining to the original table is a promising approach but I don't know how you'd implement this.

Mario Galic
  • 47,285
  • 6
  • 56
  • 98
Kevin
  • 281
  • 2
  • 5
  • 3
    Perhaps `Window` aggregation and `lag()` is what you need. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html – mazaneicha Jul 30 '20 at 17:09
  • 1
    Does this answer your question? [spark sql window function lag](https://stackoverflow.com/questions/41158115/spark-sql-window-function-lag) – notNull Jul 30 '20 at 17:26
  • Yes, this was the solution. – Kevin Jul 31 '20 at 18:46

1 Answers1

2

The solution is to use Windows functionality with a lag of 1.

import org.apache.spark.sql.expressions.Window

val df = load_my_df()

val frame = Window.partitionBy("acct_number").orderBy("date")
val df_new = df.withColumn("prev_balance", lag("balance", 1, null).over(frame))


Great examples here: https://sparkbyexamples.com/spark/spark-sql-window-functions/

Documentation: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Kevin
  • 281
  • 2
  • 5