1

Consider I have a input df with a timestamp field column and when setting window duration (with no sliding interval) as :

10 minutes

with input of time(2019-02-28 22:33:02)
window formed is as (2019-02-28 22:30:02) to (2019-02-28 22:40:02)

8 minutes

with same input of time(2019-02-28 22:33:02)
window formed is as (2019-02-28 22:26:02) to (2019-02-28 22:34:02)

5 minutes

with same input of time(2019-02-28 22:33:02)
window formed is as (2019-02-28 22:30:02) to (2019-02-28 22:35:02)

14 minutes

with input of time(2019-02-28 22:33:02)
window formed is as (2019-02-28 22:32:02) to (2019-02-28 22:46:02)


So, my question here is :

How does spark calculates the start time of a window with a given input of ts ?

supernatural
  • 1,107
  • 11
  • 34

1 Answers1

2

This is explained in the section "Understanding How Intervals are computed" in the "Stream Processing with Apache Spark" book published by O'Reilly:

"The window intervals are aligned to the start of the second/minute/hour/day that corresponds to the next" upper time magnitude of the time unit used."

In your case you are always using minutes so the next upper time magnitude is "hour". Therefore it tries to reach the start of the hour. Your cases in more details (forget about the 2 seconds, this is just a delay in the internals):

  • 10 minutes: 22:40 + 10 + 10 -> start of the hour
  • 8 minutes: 22:34 + 8 + 8 + 8 -> start of the hour
  • 5 minutes: 22:35 + 5 + 5 + ... + 5 -> start of the hour
  • 14 minutes: 22:46 + 14 -> start of the hour

It is independent of the incoming data and its timestamp/event_time.

As an additional node, the lower window boundary is inclusive whereas the upper one is exclusive. In mathematical notations this would look like [start_time, end_time).

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • Thanks Mike. Had a question on 8 minute window though. 8 minutes: 22:34 + 8 + 8 + 8 -> will not take that to the start of next hour. Shouldn't it have been (2019-02-28 22:28:02) to (2019-02-28 22:36:02) by this logic? – sam22 Jan 27 '21 at 14:12
  • 1
    Yes @sam22, you are completely right! How did I calculated 34 + 24 to be 60...? This is confusing indeed, but I stil hope the theory helps to get an understanding. Not sure now, what is causing this deviation to be honest. – Michael Heil Jan 27 '21 at 14:36
  • I guess there's more to this windowing logic. While for above dates, I get same windows, for 2022, I get different windows. Got |[2022-02-28 22:26:00, 2022-02-28 22:40:00]| for 14 minutes. – Swapnil May 14 '22 at 12:24