Why do some joins not work without selectKey first?

Question

In doing my joins, I am finding that the 2nd block tends to give the expected result, whereas the 1st block does not and never hits the (aValue, bValue) -> myFunc(aValue, bValue). I didn't think the actual key mattered as long as I set the right field to join on (aKey, aValue) -> aValue.get("someField").asText(), but there is something about using .selectKey((aKey, aValue) -> aValue.get("someField").asText()) beforehand that makes the join go through correctly. I have also seen some cases that did not require the selectKey. Can someone explain the difference?

// does not join correctly and gives unexpected result
KStream<String, JsonNode> c = a
  .leftJoin(b,
    (aKey, aValue) -> aValue.get("someField").asText(),
    (aValue, bValue) -> myFunc(aValue, bValue)
);

// does join correctly and gives expected result
KStream<String, JsonNode> c = a
  .selectKey((aKey, aValue) -> aValue.get("someField").asText())
  .leftJoin(b,
    (aKey, aValue) -> aKey,
    (aValue, bValue) -> myFunc(aValue, bValue)
);

score 0 · Answer 1 · answered Apr 18 '20 at 22:28

There are many different joins with different semantics in Kafka Streams and I am not 100% sure what join you execute?

Given your example, it seems you are using a KStream-GlobalKTable join; b seems to be a GlobalKTable and the second argument (i.e., (aKey, aValue) -> aValue.get("someField").asText() is your keySelector?

If this it correct, the first code snippet looks correct to me. What version are you using (maybe there is some bug in Kafka Streams)? Can you also share the output of Topology#describe()#toString() for both cases?

Why do some joins not work without selectKey first?

1 Answers1