3

I am having a hard time comprehending how Windowing works in Kafka Streams. The results don't seem to align with what I have read and understood so far.

I have created a KSQL Stream with a backing topic. one of the 'columns' in the KSQL SELECT statement has been designated as the TIMESTAMP for the topic.

CREATE STREAM my_stream WITH (KAFKA_topic='my-stream-topic', VALUE_FORMAT='json', TIMESTAMP='recorded_timestamp') AS select <select list> PARTITION BY PARTITION_KEY;

Records in my-stream-topic are grouped by the key (PARTITION_KEY) and windowed with a hopping window

val dataWindowed: TimeWindowedKStream[String, TopicValue] = builder.stream('my-stream-topic', consumed) 
    .groupByKey(Serialized.`with`(Serdes.String(), valueSerde))
    .windowedBy(TimeWindows.`of`(TimeUnit.MINUTES.toMillis(5)).advanceBy(TimeUnit.MINUTES.toMillis(1)).until(TimeUnit.MINUTES.toMillis(5)))

Records are aggregated via

val dataAgg: KTable[Windowed[String], ValueStats] = dataWindowed
    .aggregate(
      new Initializer[TopicStats] {<code omitted>}},
      new Aggregator[String, TopicValue, TopicStats] {<code omitted>}},
      Materialized.`as`[String, TopicStats, WindowStore[Bytes, Array[Byte]]]("time-windowed-aggregated-stream-store")
        .withValueSerde(new JSONSerde[TopicStats])
    )

  val topicStats: KStream[String, TopicValueStats] = dataAgg
    .toStream()
    .map( <code omitted for brevity>)

I then print to console via

dataAgg.print()
topicStats.print()

The first window that's in the group translates to 7:00 - 7:05

When I examine the records in my-stream-topic via a console consumer I see that there are 2 records that should fall within the above window. However, only 1 of them is picked up by the aggregator.

I thought that the dataAgg windowed KTable would contain 1 record for the grouped key but the aggregate would have used the 2 records to compute the aggregate. The aggregate value printed is incorrect.

What am I missing?

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
rams
  • 6,381
  • 8
  • 46
  • 65
  • Your understanding sounds correct. Why do you show a KSQL statement though? KSQL is unrelated to running your Kafka Streams code. Thus, one suspicion might be, that your application does not use the correct timestamp. You would need to configure a custom timestamp extractor and set it via `StreamsConfig`. By default, Kafka Streams uses the embedded metadata timestamp. – Matthias J. Sax Jun 01 '18 at 05:06
  • I included KSQL to show that I am designating a column as timestamp column. Ksql docs say that that column will be used for window operations. Avoiding creating a timestamp extractor. when I print the topic source stream in Ksql CLI, I see that the rowtime is has the right timestamp value. Can I not rely on ksql generated timestamp? – rams Jun 01 '18 at 09:57
  • You can rely on KSQL to generate timestamp, however, you need to write a KSQL query... Defining a KSQL stream does not change the timestamp in your input topic "my-stream-topic". What you could do it, to write the KSQL stream into a second topic, and than consume the second topic with your Kafka Streams application. For this case, in the second topic the record metadata timestamp would be set as the extracted column timestamp and Kafka Streams would pick it up correctly. Does this make sense? – Matthias J. Sax Jun 01 '18 at 15:15
  • @MatthiasJ.Sax my-stream-topic is the topic backing the stream I create in KSQL (my-stream). Isn't KSQL creating my-stream-topic and updating the right timestamp semantics? the KSQL is CREATE STREAM my_stream WITH (KAFKA_topic='my-stream-topic', VALUE_FORMAT='json', TIMESTAMP='recorded_timestamp') AS select P_ID+'-'+L_ID+'-'+U_ID+'-'+R_ID+'-'+B_ID+'-'+V_TYPE as PARTITION_KEY, P_ID, U_ID, R_ID, B_ID, V_DTM as recorded_dtm, V_TYPE, L_ID, (DTM_TIMESTAMP * 1000) as recorded_timestamp, VALUE, FROM my_other_stream WHERE D_TYPE = 'NM') PARTITION BY PARTITION_KEY; – rams Jun 01 '18 at 15:21
  • 1
    Ok, I have not explained it very well... It is correct that you write to an output topic name `my-stream-topic`, however, the `with-TIMESTAMP` clause is only a metadata operation for created stream `my_stream`. If you use `my_stream` in a second query, the second query will use `recorded_timestamp` as timestamp by default. However, the record written into the `my-stream-topic` inherit the timestamp for the source topic from your `select...` part. – Matthias J. Sax Jun 01 '18 at 18:17
  • 1
    I am not sure what your `select...` parts computes... if it does not operate on the timestamp, you can set the input timestamp for you `select...` part as `recorded_timestamp` -- this was, the `recorded_timestamp` will be set in the record metadata field when writing to `my-stream-topic`. I agree that the semantics are not very intuitive. I talked to a colleague who works on KSQL and we will raise an issue to fix it. – Matthias J. Sax Jun 01 '18 at 18:20
  • 1
    Created: https://github.com/confluentinc/ksql/issues/1367 – Matthias J. Sax Jun 01 '18 at 18:26
  • @MatthiasJ.Sax thank you for the clarifications and for creating the issue. I have 3 topics that I need to join. I use KSQL to perform all my joins and transforms and finally produce a stream with all the data I need to compute stats on. Stats are computed in KStreams app. I was using the backing topic from my final stream as a source in KStreams app. If I understand correctly, I need to create yet another stream so the timestamps are set to my explicit specific column and windowing happens as expected. Is that right? – rams Jun 01 '18 at 18:38
  • Sounds correct. See my answer below. – Matthias J. Sax Jun 01 '18 at 18:53

1 Answers1

2

KSQL can set record timestamps on write, however you need to specify the timestamp when creating an input stream, not when defining the output stream. Ie, the timestamp specified for the input stream will be used to set the record metadata field on write.

This behavior is rather unintuitive and I opened a ticket for this issue: https://github.com/confluentinc/ksql/issues/1367

Thus, you need to specify the with(TIMESTAMP='recorded_timestamp') clause when creating the input stream for the query you showed in the question. If this is not possible, because your query needs to operate on a different timestamp, you need to specify a second query that copies the data into a new topic.

CREATE STREAM my_stream_with_ts
    WITH (KAFKA_topic='my-stream-topic-with-ts')
AS select * from my_stream PARTITION BY PARTITION_KEY;

As an alternative, you can set a custom timestamp extractor for you Kafka Streams application to extract the timestamp from the payload.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • I tried to use a custom timestamp extractor but ran into compilation issues https://stackoverflow.com/questions/50646537/scala-kafka-streams-custom-timestamp-extractor-causes-compilation-error – rams Jun 01 '18 at 19:12