1

I have a DB that stores pageviews per webpage. It does that by consuming a Kafka topic named pageviews, where each message have the page name as the key and the value as the number of views since the previous message.

This is a sample of the messages that are expected in pageviews topic:

pageviews topic:

key: "index", value: 349
key: "products", value: 67
key: "index", value: 15
key: "about", value: 11
...

The consumer of pageviews adds each time the above values to the PAGEVIEWS table.

Now, I am building the producer of pageviews topic. The data source of this application is the viewstream topic, where one message is created per view, like:

viewstream topic:

key: "index", value: <timestamp>
key: "index", value: <timestamp>
key: "product", value: <timestamp>
...

On the Kafka Stream application I have the following topology:

PageViewsStreamer:

builder.stream("viewstream")
    .groupByKey()
    .aggregate(...) // this builds a KTable with the sums of views per page
    .toStream()
    .to("pageviews")

I have 2 problems with this topology:

  1. The KTable that holds the aggregations does not get reset/purge after producing the output message to pageviews, so by simply adding the aggregated value to the DB table we get wrong results. How can I achieve each message sent to pageviews not to include views already sent in previous messages?

  2. I want pageviews messages to be sent once every 15 minutes (the default rate is about every 30 seconds).

I am trying to work with windowing for both, but I have failed so far.

Tuyen Luong
  • 1,316
  • 8
  • 17
geexee
  • 339
  • 2
  • 13

1 Answers1

2

You can achieve this behavior using a 15-minute tumbling windows and suppress the results until the windows time has passed (remember to add a grace time to bound the lateness of events which the the previous window will accept). View details here. I would do something like this:

builder.stream("viewstream")
                .groupByKey()
                //window by a 15-minute time windows, accept event late in 30 second, you can set grace time smaller
                .windowedBy(TimeWindows.of(Duration.ofMinutes(15)).grace(Duration.ofSeconds(30)))
                .aggregate(...) // this builds a KTable with the sums of views per page
                .suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
                .toStream()
                //re-select key : from window to key
                .selectKey((key, value) -> key.key())
                .to("pageviews");
Tuyen Luong
  • 1,316
  • 8
  • 17
  • Thank you Tuyen. I did an experiment with a 30 + 5 seconds window, and what I am experiencing is (after reseting all topics): (a) the producer sends 100 messages within 50 seconds, (b) all messages are being parsed by the streamer but only 30 of them are being aggregated and emitted. The rest, though received by the streamer, are never emitted, though I am waiting for too long. – geexee Apr 06 '20 at 07:07
  • 1
    Yes, this is expected behavior cause suppress emit message based on stream time, so some of your last aggregated messages in last windows will not be emitted unless new messages coming in. I have a solution using processor API in this question: https://stackoverflow.com/questions/60822669/kafka-sessionwindow-with-suppress-only-sends-final-event-when-there-is-a-steady – Tuyen Luong Apr 06 '20 at 07:15
  • 1
    This should not be a problem if you have constantly incoming messages – Tuyen Luong Apr 06 '20 at 07:17