Eviction of data from Kafka key value state store

Question

I am performing aggregation using kafkaStreams which actually keeps all my aggregated records into a keyValue state store against a specific key which i am generating to uniquely identify that aggregation. I am not using any kafka window for this aggregation. so essentially this method will keep of listening to input data and thus keep on aggregating. Now based on the key, i need to apply different logic to search from the stateStore and move my data downstream.

Kafka's KeyValueStateStore gives me 4 methods viz, all, prefixScan, range and get. Of which based on the key i am generating, i find i can only use all and get.

if i use get, kafka will internally scan over the complete statestore and give me data for the specific key, so if i have a list of keys, it will iterate over complete statestore for the number of keys in the list.
If I manage to create a regex for my search key, i can use all() and iterate over all data in statestore in a java logic and search for my regex and move downstream. but again it will be a manual iteration over the complete statestore.

P.S. at any point in time my statestore will contain at least a billion records.

Can someone please suggest the best (performance wise) possible way to retrieve data using a key search into kafka keyValue stateStore. or any alternative to the approach is appreciated.

Update: After eviction of data from statestore, i am not deleting it but wish to update it with a flag stating evicted or not. Which can only be possible by having a read/write access to the statestore which is again only available through pipeline as interactive queries give only a read access to the statestore. This is what my knowledge of Kafka limits to. Please help if otherwise.

It's a KV-store... `get` should be an O(1) operation, and not scan over the entire store. If you really want something more performant, then I feel like using a proper database server with indicies would help — OneCricketeer, Jun 23 '22 at 19:15

score 0 · Answer 1 · answered Jun 24 '22 at 05:01

0

I think you should use Apache Spark streaming to use this

Read data from Kafka through spark streaming
Perform aggregations/transformations in spark
Push the sanitized data into desired downstream topics

I am not sure if this can be done in Kafka

answered Jun 24 '22 at 05:01

Debug Logs

61
6

Yes, Kafka Streams can do this. Spark's own aggregate state operations dont really solve the question that was asked. – OneCricketeer Aug 02 '22 at 22:30

Eviction of data from Kafka key value state store

1 Answers1