I'd like to outer join several (typically 2-10) Kafka topics by key, ideally using the streaming API. All topics will have the same key and partitions. One way to do this join is to create a KStream
for each topic and chain calls to KStream.outerJoin
:
stream1
.outerJoin(stream2, ...)
.outerJoin(stream3, ...)
.outerJoin(stream4, ...)
However, the documentation of KStream.outerJoin
suggests that each call to outerJoin
will materialize its two input streams so the above example would materialize not just streams 1 to 4 but also stream1.outerJoin(stream2, ...)
and stream1.outerJoin(stream2, ...).outerJoin(stream3, ...)
. There would be a lot of unnecessary serialization, deserialization, and I/O compared to directly joining the 4 streams.
Another problem with the above approach is that the JoinWindow
would not be consistent across all 4 input streams: one JoinWindow
would be used to join streams 1 and 2, but then a separate join window would be used to join this stream and stream 3, etc. For example, I specify a join window of 10 seconds for each join and entries with a certain key appear in stream 1 at 0 seconds, stream 2 at 6 seconds, stream 3 at 12 seconds, and stream 4 at 18 seconds, the joined item would get output after 18 seconds, causing an overly high delay. The results depend on the order of the joins, which seems unnatural.
Is there a better approach to multi-way joins using Kafka?