0

I'm trying to use Kafka steams to reduce a series of numbers, and I only want a record out when data has changed. It works perfect, but the problem is that it is not catching up on data from kafka if the service running the code has been down. So I guess the solution is wrong? My code:

KGroupedStream<String, JsonNode> groupedStream = filteredStream.groupByKey( Serdes.String(), jsonSerde);
KTable<String, JsonNode> reducedTable = groupedStream.reduce(
                (aggValue, newValue) ->  Calculate.newValue( newValue, aggValue, logger) ,/* adder */
                "reduced-stream-store" /* state store name */);
KStream<String, JsonNode> reducedStream =  reducedTable.toStream();

the "Calculate" method :

if (value != oldValue)
 return value
else return  null.

thanks if you have comments/sugestions

Margit
  • 1
  • 2

1 Answers1

0

return null in your code will delete the entry from the result table. Hence, your code does not do what you expect.

In fact, the DSL operators emit "on update" not "on change" and thus you cannot use the DSL for your use case. There is a ticket that suggests to add "emit on change" semantics (https://issues.apache.org/jira/browse/KAFKA-8770).

As a workaround, you will need to use a custom transform() with stat store instead. For each input record, you check if it exists in the store. If no, emit the record and put it into the store. If if does exist and is the same, don't emit anything. If it is different emit and update the store.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thanks, I will try that out i a later iteration and get back – Margit Oct 21 '19 at 11:32
  • I'm concerned about how is it more performant to make deletions: if we create a KVStore like shown [here](https://github.com/davidandcode/hackathon_kafka/blob/master/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java) we can create punctuations - but that may make the stream to wait for an indetermined time during bulk deletions. Another option is to re-read and send tombstones to the topic. And another one is to create a fake-windowed store like suggested [here](https://stackoverflow.com/questions/51070790/kafka-streams-rocksdb-ttl). What are trade-offs? – xmar Jul 02 '20 at 16:30
  • Not sure how your comment relates to the question of "emit on update" vs "emit on change" semantics? Maybe start a new question? – Matthias J. Sax Jul 04 '20 at 18:53