Adding new column of previous date for account in table

Question

Suppose you have a table with snapshots of accounts from the past 4 years, 1 row per date/account pair. (Note accounts don't need to have a row for every day).

For simplicity suppose there are 3 columns: acct_number, date and balance. How would you add 2 more columns to each row that are the date/balance for that account for the previous date recorded in the table?

It seems like sorting by acct_number and date then "offsetting by one" and joining to the original table is a promising approach but I don't know how you'd implement this.

Perhaps `Window` aggregation and `lag()` is what you need. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html — mazaneicha, Jul 30 '20 at 17:09
Does this answer your question? [spark sql window function lag](https://stackoverflow.com/questions/41158115/spark-sql-window-function-lag) — notNull, Jul 30 '20 at 17:26

score 2 · Answer 1 · answered Jul 31 '20 at 18:50

The solution is to use Windows functionality with a lag of 1.

import org.apache.spark.sql.expressions.Window

val df = load_my_df()

val frame = Window.partitionBy("acct_number").orderBy("date")
val df_new = df.withColumn("prev_balance", lag("balance", 1, null).over(frame))

Great examples here: https://sparkbyexamples.com/spark/spark-sql-window-functions/

Documentation: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Adding new column of previous date for account in table

1 Answers1