1

I am trying to architect an event streaming system to replace our existing database table polling mechanism. We currently have a process where Application ABC will query/scan the entire XYZ (MySQL) table every 5 minutes so that we may get any updates to our data and cache them on Application ABC. As our data grows this will not be scalable or performant.

Instead, I want to have Application ABC read from a Kafka stream that contains any new events around the XYZ table, and use that to modify Application ABC's in-memory cache.

Where I'm having a hard time formulating a good solution is the initial database table load onto the Kafka stream. Since all the XYZ data that would be consumed by Application ABC is cached, we lose that data when we redeploy all of the Application ABC nodes. So we would need some kind of mechanism to be able to get all the XYZ data from the initial load onto the stream. I know Kafka streams are supposed to allow for infinite retention but I'm not sure if infinite retention is a realistic solution in this case due to cost.

What's the usually prescribed solution around this initial load case where Application ABC would need to reload the entire database again off of the stream (every time a new instance is spun up)? Also trying to think about what is the most performant solution here so that Application ABC has the lowest latency to be able to gather all the data it needs from XYZ Table.

Another constraint to mention is that Application ABC needs to have this data in memory for performance reasons. We need to be able to iterate over the entire XYZ data set at all times. We cannot do simple queries by ID.

Farhan Islam
  • 609
  • 2
  • 7
  • 21

1 Answers1

2

There is a bit to unpack here but here are is some info.

Instead of polling the DB, consider using a source connector to get the data into Kafka. Debezium is made for this. You havent specified what type of database you are using, but it does support quite a few variants. The mechanism is called CDC - Change Data Capture, and it needs to be enabled on the database and each of the tables first.

As for the Application ABC side - consider using a distributed cache with persistence enabled. Redis is a good option for this. This way it will retain the data even if your application is restarted. Reloading all the data back from Kafka is not a good idea, this will take a long time (depending on the amount of data) the application will be unavailable for that duration after a restart.

Joe M
  • 2,527
  • 1
  • 25
  • 25
  • A distributed cache isn't really an option since Application ABC needs to have this data in memory for performance reasons. We need to be able to iterate over the entire XYZ data set at all times. We cannot do simple queries by ID. – Farhan Islam Nov 23 '22 at 02:19
  • We potentially could have some kind of mechanism that reads off of a persistence cache on startup and load it into Application ABC's memory. – Farhan Islam Nov 23 '22 at 02:19
  • I'm not an expert on Redis, but I do know that it's able to serve requests from in-memory, while also persisting the data on disk for durability, which seems to fit in well with your performance requirement as well as your durability requirement. See: https://redis.io/docs/getting-started/faq/ – Joe M Nov 23 '22 at 05:07
  • Why Redis over Kafka Streams/RocksDB? – OneCricketeer Nov 23 '22 at 13:14
  • I thought Kafka Streams is more suited to real time processing, but do you recon it's suited for his requirement? If so I'm happy to stand corrected. – Joe M Nov 23 '22 at 22:07
  • Kafka Streams can build persistent KV stores, local to the app, via real time processing KV pairs in topics – OneCricketeer Dec 30 '22 at 03:27