I am currently running into the issue where I want to use a window and sliding interval on my csv and for each window perform data aggregation to get the most common category. However I do not have a timestamp and I want to perform the window sliding on the index column. Can anyone point me in the right direction on how to use windows + sliding intervals on the index?
In short i want to create windows+intervals over the index column.
Currently I have something like this:
schema = StructType().add("index", "string").add(
"Category", "integer")
dataframe = spark \
.readStream \
.option("sep", ",") \
.schema(schema) \
.csv("./tmp/input")
# TODO perform Window + sliding interval on dataframe, then perform aggregation per window
aggr = dataframe.groupBy("Category").count().orderBy("count", ascending=False).limit(3)
query = aggr \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()