1

I am performing aggregation using kafkaStreams which actually keeps all my aggregated records into a keyValue state store against a specific key which i am generating to uniquely identify that aggregation. I am not using any kafka window for this aggregation. so essentially this method will keep of listening to input data and thus keep on aggregating. Now based on the key, i need to apply different logic to search from the stateStore and move my data downstream.

Kafka's KeyValueStateStore gives me 4 methods viz, all, prefixScan, range and get. Of which based on the key i am generating, i find i can only use all and get.

  1. if i use get, kafka will internally scan over the complete statestore and give me data for the specific key, so if i have a list of keys, it will iterate over complete statestore for the number of keys in the list.
  2. If I manage to create a regex for my search key, i can use all() and iterate over all data in statestore in a java logic and search for my regex and move downstream. but again it will be a manual iteration over the complete statestore.

P.S. at any point in time my statestore will contain at least a billion records.

Can someone please suggest the best (performance wise) possible way to retrieve data using a key search into kafka keyValue stateStore. or any alternative to the approach is appreciated.

Update: After eviction of data from statestore, i am not deleting it but wish to update it with a flag stating evicted or not. Which can only be possible by having a read/write access to the statestore which is again only available through pipeline as interactive queries give only a read access to the statestore. This is what my knowledge of Kafka limits to. Please help if otherwise.

pranayd
  • 11
  • 2
  • It's a KV-store... `get` should be an O(1) operation, and not scan over the entire store. If you really want something more performant, then I feel like using a proper database server with indicies would help – OneCricketeer Jun 23 '22 at 19:15

1 Answers1

0

I think you should use Apache Spark streaming to use this

  1. Read data from Kafka through spark streaming
  2. Perform aggregations/transformations in spark
  3. Push the sanitized data into desired downstream topics

I am not sure if this can be done in Kafka

Debug Logs
  • 61
  • 6