2

My code snippet:

val scoreDateByAccount = Window
      .partitionBy($"partition_column")
      .orderBy($"ordering_column")
      .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

dataFrame
  .withColumn("new_column", lead($"ordering_column", 1).over(scoreDateByAccount))

This gives me an error:

Window Frame specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$()) must match the required frame specifiedwindowframe(RowFrame, 1, 1);

If I then remove the rowsBetween portion of my snippet, no error ensues. If I choose any other function other than lead / lag, everything seems to work perfectly. I need to specify the window range because I intend to use that very same window to generate other columns (omitted here - deemed irrelevant) as well, and, according to this answer, among others, if I do not specify the rows limit, it will default to rowsBetween(Window.unboundedPreceding, Window.currentRow, which is not acceptable for me.

I've also found a bug report on Spark's Jira, from some 3 years ago, claiming that the lag function was broken (although the lead function worked properly for the OP), and the error thrown was quite similar to mine, but it is marked as Resolved nonetheless.

I also cannot find documentation on what kind of windows the lead / lag can be applied over.

What am I missing here?

Lucas Lima
  • 832
  • 11
  • 23
  • 1
    `lead`/`lag` functions should be used when there is a need to look *ahead/behind* in rows respectively. If you have to use the window spec `(Window.unboundedPreceding, Window.unboundedFollowing)` with `lead` you should actually be using a different function, because you are looking at the entire dataset which isn't its intended use case. see the [documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lead) – Vamsi Prabhala Jan 15 '20 at 16:49
  • Yes, but, as I said, I'm using other functions as well. When you say "not intended", you mean "it will break"? Because, otherwise, that is not on point for me. I just don't want to define 3 different windows. Also, the linked documentation doesn't really say anything concerning what type of window can the function be applied over. I know the function behavior. I just don't know the constraints. – Lucas Lima Jan 15 '20 at 16:52

0 Answers0