I am trying to architect an event streaming system to replace our existing database table polling mechanism. We currently have a process where Application ABC will query/scan the entire XYZ (MySQL) table every 5 minutes so that we may get any updates to our data and cache them on Application ABC. As our data grows this will not be scalable or performant.
Instead, I want to have Application ABC read from a Kafka stream that contains any new events around the XYZ table, and use that to modify Application ABC's in-memory cache.
Where I'm having a hard time formulating a good solution is the initial database table load onto the Kafka stream. Since all the XYZ data that would be consumed by Application ABC is cached, we lose that data when we redeploy all of the Application ABC nodes. So we would need some kind of mechanism to be able to get all the XYZ data from the initial load onto the stream. I know Kafka streams are supposed to allow for infinite retention but I'm not sure if infinite retention is a realistic solution in this case due to cost.
What's the usually prescribed solution around this initial load case where Application ABC would need to reload the entire database again off of the stream (every time a new instance is spun up)? Also trying to think about what is the most performant solution here so that Application ABC has the lowest latency to be able to gather all the data it needs from XYZ Table.
Another constraint to mention is that Application ABC needs to have this data in memory for performance reasons. We need to be able to iterate over the entire XYZ data set at all times. We cannot do simple queries by ID.