1

I am working on data integration project where we need to consume kafka stream of business events but produce daily and monthly reports. We require some sort of state store for stream. The approach we brainstormed so far are : Use ktable to store events and let (one to many) consumers query data for further ETL processing Or Use key-value based (like dynamoDB) to dump events and let consumers use it.

We certainly don't want to own the events and storage should go away after reporting is done. I am a little concerned about the volume of data being stored for monthly processing because when I looked at kafka topic for a week worth of events, they are in the range of GBs.

I am relatively new to this problem space so would need help considering efficiency and scalability. Also, something which is not going to be anti pattern for future use cases.

Snehal S
  • 21
  • 2
  • I see nothing wrong with dumping data to Dynamo (or even S3), then use a more appropriate BI tool to aggregate and analyze. You're going to run into lag issues where data at the "end of the reporting window" might be missing, though. – OneCricketeer Apr 11 '21 at 15:26

0 Answers0