I have 2 Kafka topics streaming the exact same content from different sources so I can have high availability in case one of the sources fails. I'm attempting to merge the 2 topics into 1 output topic using Kafka Streams 0.10.1.0 such that I don't miss any messages on failures and there are no duplicates when all sources are up.
When using the leftJoin
method of KStream, one of the topics can go down with no problem (the secondary topic), but when the primary topic goes down, nothing is sent to the output topic. This seems to be because, according to the Kafka Streams developer guide,
KStream-KStream leftJoin is always driven by records arriving from the primary stream
so if there are no records coming from the primary stream, it will not use the records from the secondary stream even if they exist. Once the primary stream comes back online, output resumes normally.
I've also tried using outerJoin
(which adds duplicate records) followed by a conversion to a KTable and groupByKey to get rid of duplicates,
KStream mergedStream = stream1.outerJoin(stream2,
(streamVal1, streamVal2) -> (streamVal1 == null) ? streamVal2 : streamVal1,
JoinWindows.of(2000L))
mergedStream.groupByKey()
.reduce((value1, value2) -> value1, TimeWindows.of(2000L), stateStore))
.toStream((key,value) -> value)
.to(outputStream)
but I still get duplicates once in a while. I'm also using commit.interval.ms=200
to get the KTable to send to the output stream often enough.
What would be the best way to approach this merge to get exactly-once output from multiple identical input topics?