12

I have 2 Kafka topics streaming the exact same content from different sources so I can have high availability in case one of the sources fails. I'm attempting to merge the 2 topics into 1 output topic using Kafka Streams 0.10.1.0 such that I don't miss any messages on failures and there are no duplicates when all sources are up.

When using the leftJoin method of KStream, one of the topics can go down with no problem (the secondary topic), but when the primary topic goes down, nothing is sent to the output topic. This seems to be because, according to the Kafka Streams developer guide,

KStream-KStream leftJoin is always driven by records arriving from the primary stream

so if there are no records coming from the primary stream, it will not use the records from the secondary stream even if they exist. Once the primary stream comes back online, output resumes normally.

I've also tried using outerJoin (which adds duplicate records) followed by a conversion to a KTable and groupByKey to get rid of duplicates,

KStream mergedStream = stream1.outerJoin(stream2, 
    (streamVal1, streamVal2) -> (streamVal1 == null) ? streamVal2 : streamVal1,
    JoinWindows.of(2000L))

mergedStream.groupByKey()
            .reduce((value1, value2) -> value1, TimeWindows.of(2000L), stateStore))
            .toStream((key,value) -> value)
            .to(outputStream)

but I still get duplicates once in a while. I'm also using commit.interval.ms=200 to get the KTable to send to the output stream often enough.

What would be the best way to approach this merge to get exactly-once output from multiple identical input topics?

Bogdan
  • 312
  • 7
  • 16
  • In general, I would recommend Processor API to solve the problem. You might also try switch to current `trunk` version (not sure is this is possible for you). Joins got reworked their, and this might solve your problem: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics The new join semantics will be included in Kafka `0.10.2` which has target release date Jan 2017 (https://cwiki.apache.org/confluence/display/KAFKA/Time+Based+Release+Plan). – Matthias J. Sax Nov 25 '16 at 07:13
  • @MatthiasJ.Sax I switched to the trunk and it seems like the `leftJoin` now behaves like an `outerJoin` for KStream-KStream joins, so I think I'll go back to the 10.1 semantics. What I'm attempting now is to create a fake stream that outputs nulls which I'll use as the primary in a leftJoin with what used to be the primary, and use that merge in a leftJoin with the secondary. I hope this will result in always having values in the primary stream, even when my primary is down (as i'll just get null from the first leftJoin). – Bogdan Nov 25 '16 at 16:42
  • The new `leftJoin` does trigger from both sides as old `outerJoin` did too (I guess that is what you mean by "seems like the leftJoin now behaves like an outerJoin"?) -- this is closer to SQL semantics than old `leftJoin` -- but `leftJoin` is still different to `outerJoin`: if right hand side triggers and does not find a join partner, it drops the record and no result is emitted. – Matthias J. Sax Nov 28 '16 at 05:37
  • I am also wondering how your keys are distributed and how frequent the same key is use with a single topic. Maybe you could just use a KTable to which consumes both topics at once help to depulicate... But as mentioned, I would highly recommend to use Processor API! – Matthias J. Sax Nov 28 '16 at 05:40
  • Ah, ok, I hadn't thought of that difference between the new `leftJoin` and `outerJoin`. I did end up using the processor API and your answer from another question (http://stackoverflow.com/a/40837977/6167108) and it works perfectly. You can add that as an answer here and I'll accept it. Thanks! – Bogdan Nov 28 '16 at 15:31

1 Answers1

8

Using any kind of join will not solve your problem, as you will always end up with either missing result (inner-join in case some streams stalls) or "duplicates" with null (left-join or outer-join in case both streams are online). See https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics for details on join semantics in Kafka Streams.

Thus, I would recommend to use Processor API that you can mix-and-match with DSL using KStream process(), transform(), or transformValues(). See How to filter keys and value with a Processor using Kafka Stream DSL for more details.

You can also add a custom store to your processor (How to add a custom StateStore to the Kafka Streams DSL processor?) to make duplicate-filtering fault-tolerant.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137