9

Based on apache Kafka docs KStream-to-KStream Joins are always windowed joins, my question is how can I control the size of the window? Is it the same size for keeping the data on the topic? Or for example, we can keep data for 1 month but join the stream just for past week?

Is there any good example to show a windowed KStream-to-kStream windowed join?

In my case let's say I have 2 KStream, kstream1 and kstream2 I want to be able to join 10 days of kstream1 to 30 days of kstream2.

Community
  • 1
  • 1
Am1rr3zA
  • 7,115
  • 18
  • 83
  • 125

2 Answers2

14

That is absolutely possible. When you define you Stream operator, you specify the join window size explicitly.

KStream stream1 = ...;
KStream stream2 = ...;
long joinWindowSizeMs = 5L * 60L * 1000L; // 5 minutes
long windowRetentionTimeMs = 30L * 24L * 60L * 60L * 1000L; // 30 days

stream1.leftJoin(stream2,
                 ... // add ValueJoiner
                 JoinWindows.of(joinWindowSizeMs)
);

// or if you want to use retention time

stream1.leftJoin(stream2,
                 ... // add ValueJoiner
                 (JoinWindows)JoinWindows.of(joinWindowSizeMs)
                                         .until(windowRetentionTimeMs)
);

See http://docs.confluent.io/current/streams/developer-guide.html#joining-streams for more details.

The sliding window basically defines an additional join predicate. In SQL-like syntax this would be something like:

SELECT * FROM stream1, stream2
WHERE
   stream1.key = stream2.key
   AND
   stream1.ts - before <= stream2.ts
   AND
   stream2.ts <= stream1.ts + after

where before == after == joinWindowSizeMs in this example. before and after can also have different values if you use JoinWindows#before() and JoinWindows#after() to set those values explicitly.

The retention time of source topics, is completely independent of the specified windowRetentionTimeMs that is applied to an changelog topic created by Kafka Streams itself. Window retention allows to join out-of-order records with each other, i.e., record that arrive late (keep in mind, that Kafka has an offset based ordering guarantee, but with regard to timestamps, record can be out-of-order).

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thanks, I will check it and accept your answer when I could run it. and I have read most of those examples you mentioned but I couldn't find any KStream Windowed join – Am1rr3zA Jan 17 '17 at 22:55
  • Also. how can I specify different window size, since in my case I want to join 10 days of stream-1 with 30 days of stream-2 – Am1rr3zA Jan 17 '17 at 23:00
  • Sorry about the examples. Seems there are only KTable joins... (thought there is a KStream-KStream-join too). Anyway. About "join 10 days of stream-1 with 30 days of stream-2": This is not possible with Kafka Streams, because Kafka Streams only support Sliding-Window-Joins -- you would need a Hopping-Window-Join. – Matthias J. Sax Jan 18 '17 at 05:40
  • Could you please explain more about your answer, what's the difference between joinWindowSizeMs and windowRetentionTimeMs cause I can not use JoinWindow.of(joinWindowSizeMs) .until(windowRetentionTimeMs) it doesn't accept the output of until I just can use JoinWindow.of(joinWindowSizeMs). And besides that when you use timming it only applies it to second stream or both? beucase based on https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics in KStream-KStream Join from what I get the window time applies to both stream, am I right? – Am1rr3zA Jan 19 '17 at 16:47
  • 1
    Just updated my question. If you use until(), you need to add a cast to make it work (I'll add a fix this this if upcoming 0.10.2 release). If you don't understand window retention time, just ignore it for now -- it' s not important. Or please start a new question to keep SO clean. It's bad to mix multiple questions into a single one. And yes, window time size applies to both streams. See the "SQL-like" statement to understand how it works. – Matthias J. Sax Jan 19 '17 at 18:24
5

In addition to what Matthias J. Sax said, there is a stream-to-stream (windowed) join example at: https://github.com/confluentinc/examples/blob/3.1.x/kafka-streams/src/test/java/io/confluent/examples/streams/StreamToStreamJoinIntegrationTest.java

This is for Confluent 3.1.x with Apache Kafka 0.10.1, i.e. the latest versions as of January 2017. See the master branch in the repository above for code examples that use newer versions.

Here's the key part of the code example above (again, for Kafka 0.10.1), slightly adapted to your question. Note that this example happens to demonstrate an OUTER JOIN.

long joinWindowSizeMs = TimeUnit.MINUTES.toMillis(5);
long windowRetentionTimeMs = TimeUnit.DAYS.toMillis(30);

final Serde<String> stringSerde = Serdes.String();
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> alerts = builder.stream(stringSerde, stringSerde, "adImpressionsTopic");
KStream<String, String> incidents = builder.stream(stringSerde, stringSerde, "adClicksTopic");

KStream<String, String> impressionsAndClicks = alerts.outerJoin(incidents,
    (impressionValue, clickValue) -> impressionValue + "/" + clickValue,
    // KStream-KStream joins are always windowed joins, hence we must provide a join window.
    JoinWindows.of(joinWindowSizeMs).until(windowRetentionTimeMs),
    stringSerde, stringSerde, stringSerde);

// Write the results to the output topic.
impressionsAndClicks.to(stringSerde, stringSerde, "outputTopic");
miguno
  • 14,498
  • 3
  • 47
  • 63