1

I'm trying to join data from two topics person and address where one person can have multiple addresses. The data published into the topics look like the following:

//person with id as key
{"id": "123", "name": "Tom Tester"}

//addresses with id as key
{"id": "321", "person_id": "123", "address": "Somestreet 12, 4321 Somewhere"}
{"id": "432", "person_id": "123", "address": "Otherstreet 12, 5432 Nowhere"}

After the join I would like to have an aggregated output (to be indexed in elasticsearch) which should look something like this:

{
  "id": "123",
  "name": "Tom Tester",
  "addresses": [
    {
      "id": "321",
      "address": "Somestreet 12, 4321 Somewhere"
    },
    {
      "id": "432",
      "address": "Otherstreet 12, 5432 Nowhere"
    }
  ]
}

Whenever person or address topic gets an update the aggregated person should also be updated. Currently I achieved to get updates on the aggregated person only when addresses are published but not when the person itself is changed. Any ideas what is wrong with this code?

@SpringBootApplication
@EnableBinding(PersonAggregatorBinding.class)
public class KafkaStreamTestApplication {

    public static void main(String[] args) {
        SpringApplication.run(KafkaStreamTestApplication.class, args);
    }

    private static final Logger LOG = LoggerFactory.getLogger(KafkaStreamTestApplication.class);

    @StreamListener
    @SendTo("person-aggregation")
    public KStream<String, PersonAggregation> process(
            @Input("person-input") KTable<String, Person> personInput,
            @Input("address-input") KTable<String, Address> addressInput) {
        KTable<String, AddressAggregation> addressAggregate = addressInput.toStream()
                .peek((key, value) -> LOG.info("addr {}: {}", key, value))
                .groupBy((k, v) -> v.getPersonId(), Grouped.with(null, new AddressSerde()))
                .aggregate(
                        AddressAggregation::new,
                        (key, value, aggregation) -> {
                            aggregate(aggregation, value);
                            return aggregation;
                        }, Materialized.with(Serdes.String(), new AddressAggregationSerde()));

        addressAggregate.toStream()
                .peek((key, value) -> LOG.info("aggregated addr: {}", value));

        return personInput.toStream()
                .leftJoin(addressAggregate, this::join, Joined.with(Serdes.String(), new PersonSerde(), new AddressAggregationSerde()))
                .peek((key, value) -> LOG.info("aggregated person: {}", value));
    }

    private PersonAggregation join(Person person, AddressAggregation addrs) {
        return PersonAggregation.builder()
                .id(person.getId())
                .name(person.getName())
                .addresses(addrs)
                .build();
    }

    public void aggregate(AddressAggregation aggregation, Address address) {
        if(address != null){
            aggregation.removeIf(it -> Objects.equals(it.getId(), address.getId()));
            if(address.isValid()) {
                aggregation.add(address);
            }
        }
    }
}
m-kay
  • 260
  • 1
  • 11
  • Maybe record caching? Try to disable the `KTable` caches: https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html – Matthias J. Sax Feb 28 '21 at 19:14
  • Unfortunately this didn't help. Since this is some time ago since I asked this question I had to setup the example again and now I have a different behavior: I only get updates in the aggregation whenever a person is updated. – m-kay Mar 02 '21 at 05:56
  • 1
    Well, you do a stream-table join, so what you observe is expected. Checkout: https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/ -- I guess you want to use a table-table join instead -> `personInput.leftJoin(addressAggregate)` -- why do you convert the `personInput` to a KStream before the join? – Matthias J. Sax Mar 02 '21 at 17:10
  • Also: https://www.confluent.io/kafka-summit-ny19/zen-and-the-art-of-streaming-joins/ and https://www.confluent.io/resources/kafka-summit-2020/the-flux-capacitor-of-kafka-streams-and-ksqldb/ – Matthias J. Sax Mar 02 '21 at 17:12
  • Yes you are right, table-table join works as expected. However I could not use Springs KTable auto binding directly because Serdes are configured wrong. I only managed to make it work when using the KStream binding and create the KTable in my code with `personInput.toTable(Materialized.with(Serdes.String(), new PersonSerde()))` – m-kay Mar 04 '21 at 05:13
  • My follow-up question then is how I could use the addresses as GlobalKTable. Since the persons and addresses are not co-partitioned and I need all the addresses joined with the person I would need a GlobalKTable which I'm not able to join with a KTable but only with KStream. – m-kay Mar 04 '21 at 05:16
  • Not sure about Spring... (don't know how it really works). -- GlobalKTable-KTable joins are a missing feature in Kafka Streams (cf https://issues.apache.org/jira/browse/KAFKA-4628). However, since Kafka 2.4.0 FK-key KTable-KTable joins are supported anyway that should address this issue: https://issues.apache.org/jira/browse/KAFKA-3705 – Matthias J. Sax Mar 08 '21 at 23:09

0 Answers0