20

Running the following code:

val sales = Seq(
  (0, 0, 0, 5),
  (1, 0, 1, 3),
  (2, 0, 2, 1),
  (3, 1, 0, 2),
  (4, 2, 0, 8),
  (5, 2, 2, 8))
  .toDF("id", "orderID", "prodID", "orderQty")

val orderedByID = Window.orderBy('id')

val totalQty = sum('orderQty').over(orderedByID).as('running_total')
val salesTotalQty = sales.select(*, totalQty).orderBy('id')
salesTotalQty.show()

The result is:

+---+-------+------+--------+-------------+
| id|orderID|prodID|orderQty|running_total|
+---+-------+------+--------+-------------+
|  0|      0|     0|       5|            5|
|  1|      0|     1|       3|            8|
|  2|      0|     2|       1|            9|
|  3|      1|     0|       2|           11|
|  4|      2|     0|       8|           19|
|  5|      2|     2|       8|           27|
+---+-------+------+--------+-------------+

There is no window frame defined in the above code, it looks the default window frame is rowsBetween(Window.unboundedPreceding, Window.currentRow)

Not sure my understanding about default window frame is correct

Ankit Agrawal
  • 616
  • 9
  • 20
Tom
  • 5,848
  • 12
  • 44
  • 104

1 Answers1

31

From Spark Gotchas

Default frame specification depends on other aspects of a given window defintion:

  • if the ORDER BY clause is specified and the function accepts the frame specification, then the frame specification is defined by RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,
  • otherwise the frame specification is defined by ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
Community
  • 1
  • 1
user8892212
  • 326
  • 3
  • 2
  • Thanks @user8892212. I tried with and without ORDER BY clause, the default window frame is just behaving as you said. – Tom Nov 06 '17 at 05:48
  • 1
    The answer is correct, alright, but, do you have some context of what exactly is Spark Gotchas? I was hoping to find some documentation from either Spark or DataBricks, not a seemingly random git repo describing this behavior. Trying to go briefly through the readme on the said repo, they cut straight to the point, without introducing themselves (from what I've seen). – Lucas Lima Jan 14 '20 at 16:24
  • 2
    Is this true in general? What about `SELECT RANK() OVER(ORDER BY grades) FROM students`. It doesn't find the rank of the student up to and including the current row. It finds the rank of the student out of the whole dataset, which is `BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`. This is contrary to your guideline. I think there are some functions which are exceptions to your rule although it is generally true. – doublefelix Jul 05 '21 at 20:53
  • @doublefelix the following comment was just posted as [an answer](https://stackoverflow.com/a/75244969/5211833): "RANKING window functions do not accept FRAMING." – Adriaan Jan 26 '23 at 10:38