I am getting a data stream of the form:
+--+---------+---+----+
|id|timestamp|val|xxx |
+--+---------+---+----+
|1 |12:15:25 | 50| 1 |
|2 |12:15:25 | 30| 1 |
|3 |12:15:26 | 30| 2 |
|4 |12:15:27 | 50| 2 |
|5 |12:15:27 | 30| 3 |
|6 |12:15:27 | 60| 4 |
|7 |12:15:28 | 50| 5 |
|8 |12:15:30 | 60| 5 |
|9 |12:15:31 | 30| 6 |
|. |... |...|... |
I am interested in applying window operation to the xxx
column just like the window operation over timestamp is available in Spark Streaming with some window size and sliding step.
Let in the groupBy
with window function below, lines
represent a streaming dataframe with window size: 2 and sliding step: 1.
val c_windowed_count = lines.groupBy(
window($"xxx", "2", "1"), $"val").count().orderBy("xxx")
So, the output should be as follows:
+------+---+-----+
|window|val|count|
+------+---+-----+
|[1, 3]|50 | 2 |
|[1, 3]|30 | 2 |
|[2, 4]|30 | 2 |
|[2, 4]|50 | 1 |
|[3, 5]|30 | 1 |
|[3, 5]|60 | 1 |
|[4, 6]|60 | 2 |
|[4, 6]|50 | 1 |
|... |.. | .. |
I tried using partitionBy
but it is not supported in Spark Structured Streaming.
I am using Spark Structured Streaming 2.3.1.
Thanks!