I am having a hard time comprehending how Windowing works in Kafka Streams. The results don't seem to align with what I have read and understood so far.
I have created a KSQL Stream with a backing topic. one of the 'columns' in the KSQL SELECT statement has been designated as the TIMESTAMP for the topic.
CREATE STREAM my_stream WITH (KAFKA_topic='my-stream-topic', VALUE_FORMAT='json', TIMESTAMP='recorded_timestamp') AS select <select list> PARTITION BY PARTITION_KEY;
Records in my-stream-topic are grouped by the key (PARTITION_KEY) and windowed with a hopping window
val dataWindowed: TimeWindowedKStream[String, TopicValue] = builder.stream('my-stream-topic', consumed)
.groupByKey(Serialized.`with`(Serdes.String(), valueSerde))
.windowedBy(TimeWindows.`of`(TimeUnit.MINUTES.toMillis(5)).advanceBy(TimeUnit.MINUTES.toMillis(1)).until(TimeUnit.MINUTES.toMillis(5)))
Records are aggregated via
val dataAgg: KTable[Windowed[String], ValueStats] = dataWindowed
.aggregate(
new Initializer[TopicStats] {<code omitted>}},
new Aggregator[String, TopicValue, TopicStats] {<code omitted>}},
Materialized.`as`[String, TopicStats, WindowStore[Bytes, Array[Byte]]]("time-windowed-aggregated-stream-store")
.withValueSerde(new JSONSerde[TopicStats])
)
val topicStats: KStream[String, TopicValueStats] = dataAgg
.toStream()
.map( <code omitted for brevity>)
I then print to console via
dataAgg.print()
topicStats.print()
The first window that's in the group translates to 7:00 - 7:05
When I examine the records in my-stream-topic via a console consumer I see that there are 2 records that should fall within the above window. However, only 1 of them is picked up by the aggregator.
I thought that the dataAgg windowed KTable would contain 1 record for the grouped key but the aggregate would have used the 2 records to compute the aggregate. The aggregate value printed is incorrect.
What am I missing?