I'm new to spark and I'm using spark structured streaming to read streams of data from Kafka in scala language.
I want to aggregate the last X hours of data using Apache Spark, and (if possible) write only the updates to the destination,
So let's say I want the minimum price for customer ID1
in the last 1 hours, so if I had the following events:
Events Data:
+------------------------------------------+-------------------------+---------+
|event_time |customer |price |
+------------------------------------------+-------------------------+---------+
| 2021-03-09 11:00:00 |ID1 |2000 |
| 2021-03-09 11:28:00 |ID1 |1500 |
| 2021-03-09 15:20:00 |ID1 |2500 |
+------------------------------------------+-------------------------+---------+
at 2021-03-09 11:00:00 desired output (data between 10:00:00 and 11:00:00) :
+-------------------------+------------+
|customer |min_price |
+-------------------------+------------+
|ID1 |2000 |
+-------------------------+------------+
at 2021-03-09 11:28:00 desired output (data between 10:28:00 and 11:28:00):
+-------------------------+------------+
|customer |min_price |
+-------------------------+------------+
|ID1 |1500 |
+-------------------------+------------+
at 2021-03-09 15:20:00 desired output (data between 14:20:00 and 15:20:00):
+-------------------------+------------+
|customer |min_price |
+-------------------------+------------+
|ID1 |2500 |
+-------------------------+------------+
instead, Kafka is keep outputting 1500 I tried to filter the input stream to last 1 hour I tried to use a sliding window but I'm getting too many windows and I only need the last window that ends with the last event_time.
val df = spark.readStream
.format("kafka").option("kafka.bootstrap.servers", brokers)
.option("subscribe", topics)
.option("startingOffsets", "latest").load()
val ds1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
// some more transformation ds1 => data
// this is filtering try
filteredData = data.filter($"event_time" > (current_timestamp() - expr("INTERVAL 1 hours")))
results = filteredData.groupBy($"customer").agg(min("price").alias("min_price"))
//this is sliding window
results = filteredData.groupBy(window($"event_time", "1 hours", "5 minutes"),$"customer")
.agg(min("price").alias("min_price"))
for testing purposes, I'm writing to console
is this feasible in spark structured streaming?