Embedded Kafka: KTable+KTable leftJoin produces duplicate records

Question

I come seeking knowledge of the arcane.

First, I have two pairs of topics, with one topic in each pair feeding into the other topic. Two KTables are being formed by the latter topics, which are used in a KTable+KTable leftJoin. Problem is, the leftJoin producing THREE records when I produce a single record to either KTable. I would expect two records in the form (A-null, A-B) but instead I get (A-null, A-B, A-null). I have confirmed that the KTables are receiving a single record each.

I have fiddled with the CACHE_MAX_BYTES_BUFFERING_CONFIG to enable/disable state store caching. The behavior above is with CACHE_MAX_BYTES_BUFFERING_CONFIG set to 0. When I use the default value for CACHE_MAX_BYTES_BUFFERING_CONFIG I see the following records output from the join: (A-B, A-B, A-null)

Here are the configurations for streams, consumers, producers:

properties.put(StreamsConfig.APPLICATION_ID_CONFIG, appName);
properties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapUrls);
properties.put(StreamsConfig.STATE_DIR_CONFIG, String.format("/tmp/kafka-streams/%s/%s",
properties.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0); // fiddled with
properties.put(StreamsConfig.CLIENT_ID_CONFIG, appName);
properties.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000);
properties.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 1);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.GROUP_ID_CONFIG, appName);
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.cla
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);

The Processor API code (sanitized) that experiences this behavior is below, notice the topics paired [A1, A2] and [B1, B2]:

    KTable<Long, Value> kTableA =
        kstreamBuilder.table(longSerde, valueSerde, topicA2);

    kstreamBuilder.stream(keySerde, envelopeSerde, topicA1)
        .to(longSerde, valueSerde, topicA2);

    kstreamBuilder.stream(keySerde, envelopeSerde, topicB1)
        .to(longSerde, valueSerde, topicB2.topicName);

    KTable<Long, Value> kTableB =
        kstreamBuilder.table(longSerde, valueSerde, topicB2.topicName);

    KTable<Long, Result> joinTable = kTableA.leftJoin(kTableB, (a,b) -> {
        // value joiner called three times with only a single record input
        // into topicA1 and topicB1
    });

    joinTable.groupBy(...)
    .aggregate(...)
    .to(longSerde, aggregateSerde, outputTopic);

Thanks in advance for any and all help, oh benevolent ones.

Update: I was running with one kafka server and 1 partition per topic and experienced this behavior. When I increased the number of servers to 2 and number of partitions to 3, my output becomes (A-null).

It seems to me I need to spent some more time with the kafka manual...

maybe it's related for you https://stackoverflow.com/questions/51273180/kafka-streams-join-produce-duplicates — mike, Jul 18 '18 at 19:16
Hi @mike, thanks for the link. Unfortunately, I'm working without caching as the linked answer suggests. I have in my post the outcome for my setup w/ and w/o caching, both appear to have duplicate records created. — Freestyle076, Jul 18 '18 at 19:28
This does not appear to be related to the embedded broker - I get duplicates with a physical 3-node cluster (2x A+B), but no A+null (1 partition). The admins here won't let me post my boot app as an answer (because it's not), so Gist [here](https://gist.github.com/garyrussell/1f6969cc52dd95379153dcc26f2edd84). BTW, `KStreamBuilder` is deprecated in favor of `StreamsBuilder` in recent clients. — Gary Russell, Jul 18 '18 at 21:48
@GaryRussell thanks for throwing this into a project for me. I didn't view this as relevant, but [more research suggests](https://issues.apache.org/jira/browse/KAFKA-4609) that the subsequent operations could have an effect on the join. I've updated the second code segment to include the groupBy/aggregate calls. Would you mind throwing this in your app and running against your physical cluster? — Freestyle076, Jul 18 '18 at 22:09
Sorry - not sure where to put that - I am not a streams guy; I just hacked your stuff to eliminate the embedded broker (since it's in your question title). Happy to try it for you if you can be more explicit with what you want me to do. But it will be tomorrow; getting late. — Gary Russell, Jul 18 '18 at 22:18
What version do you use? Join semantics changed in 0.10.2 release. Compare old and new here: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics — Matthias J. Sax, Jul 20 '18 at 16:43
Btw: embedded broker is the exact same code as used in real deployment -- so it's expected that is behaves the same. Nevertheless, the broker has nothing to do with the join result -- it's Kafka Streams code that computes the join. — Matthias J. Sax, Jul 20 '18 at 16:45
This might also help: https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/ — Matthias J. Sax, Jul 20 '18 at 16:47
@MatthiasJ.Sax thanks for hopping in on this! We're on version 0.11.0.1 of the kafka-streams library. 0.11.0.0 for the kafka_2.11 library [used by the Spring Embedded Kafka library](https://mvnrepository.com/artifact/org.springframework.kafka/spring-kafka-test/2.0.3.RELEASE). If you think the embedded kafka broker is innocent, do you see any issues with the Streams API code I've posted? — Freestyle076, Jul 20 '18 at 17:48
Maybe you are hitting: https://issues.apache.org/jira/browse/KAFKA-4609 — Matthias J. Sax, Jul 20 '18 at 20:47
Thanks again @MatthiasJ.Sax . The link is in regard to caching enabled, whereas my primary concern is my first paragraph where caching is disabled. Also, the link deals with a join, whereas I'm dealing with a leftjoin. In a leftJoin I would expect both updates to trigger the leftjoin, and receive two results: A-null and A-B. However I get A-null, A-B, A-null. Do you have clue why I would see the final A-null? — Freestyle076, Jul 20 '18 at 21:00
Not sure atm. Maybe it helps to set a breakpoint in https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableKTableLeftJoin.java#L76 to see what the join computation does. — Matthias J. Sax, Jul 20 '18 at 21:07
Great advice @MatthiasJ.Sax. I've discovered that the KTableKTableLeftJoin is operating normally. But the corresponding KTableKTableRightJoin is producing the second A-null due to [enabled "sendOldValues" functionality](https://github.com/apache/kafka/blob/9449f055c7a0b340a8d69d7365c5817464b2f6ed/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableKTableRightJoin.java#L97). A-null is seen because the oldValue for the rhs is null. I'm not sure what the intent is for `sendOldValues` but for this streams topology it seems counterintuitive. Do you have any insight? — Freestyle076, Jul 20 '18 at 22:55
Follow up^^^ the `sendOldValues` field is set by the call to `groupBy(...)` — Freestyle076, Jul 20 '18 at 22:56
Sending old value is required to compute the aggregation correctly. Note, that the old value will be "substracted" from the aggregation while the new value will be added. — Matthias J. Sax, Jul 22 '18 at 01:48
@MatthiasJ.Sax Ok, that much makes sense. The results of the aggregate are undesirable, however (I thought this was caused by the `sendOldValues` behavior). What I'm seeing input into the aggregator is three changes, i'll list them as [oldValue, newValue]: `[null, A], [A, null], [null, A]`. The output of the aggregate (notated as [key, value] takes the form of `[1, [A] ], [1, [] ], [1, [A]]`. The second output is unexpected and not desired, and it results from the second input with a `null` newValue. I've spotted where the null newvalue is being passed, link in following comment — Freestyle076, Jul 23 '18 at 15:38
[Null newValue passed](https://github.com/apache/kafka/blob/924d27c1e163bca581dd5da2a91054df8278a2fa/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableRepartitionMap.java#L90) — Freestyle076, Jul 23 '18 at 15:39
@MatthiasJ.Sax my hypothesis is that its due to disabled caching. Since an update to a previously existent record in an aggregate happens by subtraction then addition, there are two records that would undergo compaction before being output if caching were enabled. Instead, without caching, the result of subtraction makes it through as output. Problem is, we cannot enable caching due to its other side-effects. If I'm correct, any suggestions on how to circumvent this issue without enabling caching? — Freestyle076, Jul 23 '18 at 16:13
You are hitting a corner case -- in general, `[null, A], [A, null]` would go to different instances -- in your case, it's going to the same instance because the key does not change and thus you get this "undesired" but correct result. Assume the following base table [key,value] `[A, x], [B, x], [C, y]` and you rekey the table to its value and count. The result would be `[x, 2], [y, 1]`. Now let's update `[B, x]` to `[B, y]`. We need to subtract one `x` and add one `y` to the result to get `[x,1], [y,2]` -- this will emit two output record, one for each update. — Matthias J. Sax, Jul 23 '18 at 17:38
In your case, you do something similar as updating `[B, x]` to `[B, x]`. Thus Kafka Streams subtracts one `x` and adds one `x` resulting in two output records. I am not aware of a workaround, however, there was a similar discussion on the mailing list recently and it would be possible to detect this corner case and only emit a single update value instead of two. Not sure if there is a JIRA for this already -- if not feel free to create one so we can add this optimization (feel free to pick up the JIRA by yourself) — Matthias J. Sax, Jul 23 '18 at 17:43

Embedded Kafka: KTable+KTable leftJoin produces duplicate records

0 Answers0

Linked