I am seeing a very weird issue with my Kafka topic, I have a Spring Cloud Kafka stream app where I am reading from INPUT_TOPIC with around 20 million records and then grouping records by keys and aggregating based on values of other fields into a single string value and pushing that data to an output topic. I am doing this so I can use all the topic data as lookup or join with another topic, so I can read AGGREGATED_OUTPUT_TOPIC as a GlobalKStore to be joined with another topic based on keys. If I don't aggregate this way I lose the same keyed values to only latest values when I load INPUT_TOPIC without aggregating. Anyway my code is as below.
KStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(30)))
.aggregate(() -> initStr,
(key, value, agg) -> {
return agg + "::" + value.getSequence() + "|" + value.getComment()+"|"+value.getDtEntered();
},
Materialized.with(Serdes.String(), Serdes.String()))
.toStream()
.selectKey((k,v)->k.toString().substring(1).split("@")[0])
.peek(((key, value) -> log.info("key: {}, value: {}", key, value)))
.to("AGGREGATED_OUTPUT_TOPIC");
My config looks like this:
spring:
cloud:
stream:
schemaRegistryClient:
endpoint: https://kafka.shared.internal.xxxx.com.au:8081
bindings:
kstream_input_channel:
destination: INPUT_TOPIC
kafka:
streams:
binder:
applicationId: appID
brokers: b-2.shared.twluaa.c3.kafka.ap-southeast-2.amazonaws.com:9096,b-1.shared.twluaa.c3.kafka.ap-southeast-2.amazonaws.com:9096,b-3.shared.twluaa.c3.kafka.ap-southeast-2.amazonaws.com:9096
configuration:
auto.offset.reset: earliest
schema.registry.url: https://kafka.shared.internal.com.au:8081
security:
protocol: SASL_SSL
sasl:
mechanism: SCRAM-SHA-512
jaas:
config: org.apache.kafka.common.security.scram.ScramLoginModule required username="clouduser" password="dummy";
commit.interval.ms: 20000
state.dir: state-store
default:
key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
value.serde: io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde
bindings:
kstream_input_channel:
consumer:
keySerde: org.apache.kafka.common.serialization.Serdes$StringSerde
valueSerde: io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde
startOffset: earliest
**Now I see that my aggregating app runs and consumes data fine and also aggregates fine, except when it stops reading after sometime and prints this
stream-thread [appID-9a5fe835-835e-45c6-9d96-5799d1cb58b7-StreamThread-1] Processed 935164 total records, ran 0 punctuators, and committed 12 total tasks since the last update **
After this I see that the count on the output topic actually starts to reduce gradually till 0 I check this on AKHQ by refreshing on the topic, so if it had already grouped and aggregated say 5 million records, and I start seeing this exception and it starts to delete data from topic unless I stop the app, I see the topic size goto 0.
I checked this by removing the windowedBy part of the code but it is still the same any idea if the way the code is written is causing this, something like the statestore is being emptied on my dev machine where the app is run and that causes topic to empty? Just guessing as I am out of ideas...
I have also checked and updated all topic config like below
cleanup.policy delete DEFAULT_CONFIG
compression.type gzip STATIC_BROKER_CONFIG
delete.retention.ms 1 days DEFAULT_CONFIG
file.delete.delay.ms 1 days DYNAMIC_TOPIC_CONFIG
flush.messages 9223372036854776000 DEFAULT_CONFIG
flush.ms 292271023 years 2 weeks DEFAULT_CONFIG
follower.replication.throttled.replicas DEFAULT_CONFIG
index.interval.bytes 4096 DEFAULT_CONFIG
leader.replication.throttled.replicas DEFAULT_CONFIG
max.compaction.lag.ms 292271023 years 2 weeks DEFAULT_CONFIG
max.message.bytes 1048588 DEFAULT_CONFIG
message.downconversion.enable true DEFAULT_CONFIG
message.format.version 2.7-IV2 STATIC_BROKER_CONFIG
message.timestamp.difference.max.ms 292271023 years 2 weeks DEFAULT_CONFIG
message.timestamp.type CreateTime DEFAULT_CONFIG
min.cleanable.dirty.ratio 0.5 DEFAULT_CONFIG
min.compaction.lag.ms 0 seconds DEFAULT_CONFIG
min.insync.replicas 2 STATIC_BROKER_CONFIG
preallocate false DEFAULT_CONFIG
retention.bytes -1 DEFAULT_CONFIG
retention.ms 1 weeks DEFAULT_CONFIG
segment.bytes 1073741824 DEFAULT_CONFIG
segment.index.bytes 10485760 DEFAULT_CONFIG
segment.jitter.ms 0 seconds DEFAULT_CONFIG
segment.ms 1 weeks DEFAULT_CONFIG
unclean.leader.election.enable false
Also I have no consumer running that reads from the output topic AGGREGATED_OUTPUT_TOPIC either.
PPS: I had a an old question but deleted that and rephrasing it here.