KTable not deduplicating the incoming records with same keys

Question

I am trying to deduplicate the records using input topic as KTable and sinking them to output topic. But the KTable is still sinking the duplicate records to the output topic. Not sure where am I going wrong.

Here is my application.yml

spring:
  cloud:
    stream:
      function:
        bindings:
          process-in-0: input.topic
          process-out-0: output.topic
        definition: process
      kafka:
        streams:
          bindings:
            process-in-0:
              consumer:
                materializedAs: incoming-store
          binder:
            application-id: spring-cloud-uppercase-app
            brokers: localhost:9092
            configuration:
              commit:
                interval:
                  ms: 1000
                state.dir: state-store
              default:
                key:
                  serde: org.apache.kafka.common.serialization.Serdes$StringSerde
                value:
                  serde: org.apache.kafka.common.serialization.Serdes$StringSerde

As per the spring cloud stream kafka stream documentation about state store, I have added the materialized view above as incoming-store

The process() bean function takes the input topic as KTable and and sink it to output topic


    @Bean
    public Function<KTable<String, String>, KStream<String, String>> process(){
        return table -> table
                .toStream()
                .peek((k, v) -> log.info("Received key={}, value={}", k, v));
    }

For a given input of 4 records

key=111, value="a"
key=111, value="a"
key=222, value="b"
key=111, value="a"

I am expecting to get only 2 records

key=111, value="a"
key=222, value="b"

But getting all the 4 records. Any help would be really appreciated!

I would solve this by aggregating events based on the `key` and with a time window — Felipe, May 08 '21 at 20:50
Based on your comentaries I am not sure what you are trying to do. Are you trying to compact events without transform them using `KStream` https://docs.spring.io/spring-cloud-stream-binder-kafka/docs/3.1.1/reference/html/spring-cloud-stream-binder-kafka.html#kafka-tombstones ? — Felipe, May 08 '21 at 21:20
I am trying to keep only the latest update for a given event. So, if I get the same record with key=111, then I need to keep the latest record. I can do this by converting the stream to KTable something similar to [https://kafka-tutorials.confluent.io/kafka-streams-convert-to-ktable/kstreams.html](https://kafka-tutorials.confluent.io/kafka-streams-convert-to-ktable/kstreams.html) — tintin, May 09 '21 at 14:28

Felipe · Accepted Answer · 2021-05-10T10:23:36.647

You can group by a key and aggregate the events. Although you are not concatenating the strings during the aggregation process, the aggregate transformation will be used just to emit the values that you are grouping by the keys 111 or 222. Your use case is just a distinct aggregation. Every time that you aggregate you will receive (key, value, aggregate), then you keep only the value that it will be the latest value.

@Slf4j
@Configuration
@EnableAutoConfiguration
public class KafkaAggFunctionalService {

    @Bean
    public Function<KTable<String, String>, KStream<String, String>> aggregate() {
        return table -> table
                .toStream()
                .groupBy((key, value) -> key, Grouped.with(Serdes.String(), Serdes.String()))
                .aggregate(() -> "", (key, value, aggregate) ->
                                value,
                        Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as("test-events-snapshots").withKeySerde(Serdes.String()).withValueSerde(Serdes.String())
                )
                .toStream()
                .peek((k, v) -> log.info("Received key={}, value={}", k, v));
    }
}

This git repo has a lot of examples. The one that looks very similar to yours is this.

score 0 · Answer 2 · answered May 07 '21 at 19:46

0

I think the problem that you are trying to solve, will be well solved by compacted topic here. Once you deliver data with the same key to a compacted topic and compaction is enabled on broker level (which is enabled by default), each broker will start a compaction manager thread and a number of compaction threads. These are responsible for performing the compaction tasks. Compaction does nothing but keeps the latest values of each key and cleans up the older (dirty) entries.

Refer this Kafka Documentation for more details.

answered May 07 '21 at 19:46

Arpit Saxena

56
2

I am using spring-cloud stream which abstracts the KTable table =streamBuilder.toTable("input.topic"); into Function... so, yes I am using log compaction. However, the log compaction does not seem to be working. That's what the main issue I am referring in my question. – tintin May 07 '21 at 20:01
Have you verified your output topic's configuration, is it a compacted topic ? – Arpit Saxena May 07 '21 at 20:16
I am using KTable that abstraction in my stream application on the input topic and I am converting it to KStream to sink the result to the output topic, So I don't want to configure the output topic to be log compacted with any configuration like --config "cleanup.policy=compact". The question is about the cloud kafka binder not accepting the ktable input. – tintin May 07 '21 at 22:26

KTable not deduplicating the incoming records with same keys

2 Answers2