I have the following dataframe as input
+----+-------------------+
| id | date|
+----+-------------------+
| A|2016-03-11 09:00:00|
| B|2016-03-11 09:00:07|
| C|2016-03-11 09:00:18|
| D|2016-03-11 09:00:21|
| E|2016-03-11 09:00:39|
| F|2016-03-11 09:00:44|
| G|2016-03-11 09:00:49|
+----+-------------------+
and I would like to partition it into 10 seconds window that ignore larger intervals of time than the window itself (basically as if restarting the counter).
Here's an example of the expected output:
+----+-------------------+-----+
| id | date|group|
+----+-------------------+-----+
| A|2016-03-11 09:00:00| 1 |
| B|2016-03-11 09:00:07| 1 |
| C|2016-03-11 09:00:18| 2 |
| D|2016-03-11 09:00:21| 2 |
| E|2016-03-11 09:00:39| 3 |
| F|2016-03-11 09:00:44| 3 |
| G|2016-03-11 09:00:49| 4 |
+----+-------------------+-----+
As opposed to a fixed timeslicing which would group them like so:
+----+-------------------+-----+
| id | date|group|
+----+-------------------+-----+
| A|2016-03-11 09:00:00| 1 |
| B|2016-03-11 09:00:07| 1 |
| C|2016-03-11 09:00:18| 2 |
| D|2016-03-11 09:00:21| 3 |
| E|2016-03-11 09:00:39| 4 |
| F|2016-03-11 09:00:44| 5 |
| G|2016-03-11 09:00:49| 5 |
+----+-------------------+-----+
I tried looking for solutions with the functions rowsBetween
and rangeBetween
, or with some variations of time difference, but I couldn't find the proper solution.
It's probaly due to the fact that I don't know the proper terminology for this kind of windowing (it's neither tumbling nor rolling).
The closest I got was using the window function, but there were two problems:
- It performs a fixed timeslicing, which is not what I'm looking for
- I'm not sure how to assign the ID or how to aggregate it correctly
Here's the code I tried (which used the original df with an additional 'val' column with random values):
w = df.groupBy(window("date", "10 seconds")).agg(sum("val").alias("sum"))
w.select(w.window.start.cast("string").alias("start"),
w.window.end.cast("string").alias("end"), "sum").collect()
Any help would be greatly appreciated.