My code snippet:
val scoreDateByAccount = Window
.partitionBy($"partition_column")
.orderBy($"ordering_column")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
dataFrame
.withColumn("new_column", lead($"ordering_column", 1).over(scoreDateByAccount))
This gives me an error:
Window Frame specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$()) must match the required frame specifiedwindowframe(RowFrame, 1, 1);
If I then remove the rowsBetween
portion of my snippet, no error ensues. If I choose any other function other than lead
/ lag
, everything seems to work perfectly. I need to specify the window range because I intend to use that very same window to generate other columns (omitted here - deemed irrelevant) as well, and, according to this answer, among others, if I do not specify the rows limit, it will default to rowsBetween(Window.unboundedPreceding, Window.currentRow
, which is not acceptable for me.
I've also found a bug report on Spark's Jira, from some 3 years ago, claiming that the lag
function was broken (although the lead
function worked properly for the OP), and the error thrown was quite similar to mine, but it is marked as Resolved nonetheless.
I also cannot find documentation on what kind of windows the lead
/ lag
can be applied over.
What am I missing here?