0

I have an architecture where I would like to query a ksqlDB Table from a Kafka stream A (created by ksqlDB). On startup, Service A will load in all the data from this table into a hashmap, and then afterward it will start consuming from Kafka Stream A and act off any events to update this hashmap. I want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the table, and when I started consuming off Kafka Stream A. Is there a way that I can retrieve the latest offset that my query to the table is populated by so that I can use that offset to start consuming from Kafka Stream A?

enter image description here

Another thing to mention is that we have hundreds of instances of our app going up and down so reading directly off the Kafka stream is not an option. Reading an entire stream worth of data every time our apps come up is not a scalable solution. Reading in the event streams data into a hashmap on the service is a hard requirement. This is why the ksqlDB table seems like a good option since we can get the latest state of data in the format needed and then just update based off of events from the stream. Kafka Stream A is essentially a CDC stream off of a MySQL table that has been enriched with other data.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
Farhan Islam
  • 609
  • 2
  • 7
  • 21
  • Don't use Hashmaps. What have you tried by creating a KTable local to that app rather than a ksqlDB one? – OneCricketeer Dec 30 '22 at 03:10
  • My business use case requires hashmaps. That's something that can't change in this implementation. What do you mean by creating a KTable local to that app? – Farhan Islam Dec 30 '22 at 03:17
  • Actually consume from Kafka. Use Kafka Streams API to create a KTable. Do not do remote lookups. Then you won't have inconsistency problems – OneCricketeer Dec 30 '22 at 03:24
  • If I go that route then I'd need to load in the entire streams worth of data from the earliest offset on startup each time. Isn't that a bad implementation? – Farhan Islam Dec 30 '22 at 03:26
  • Not really. Confluent Schema Registry is exactly built that way. Depends on the size of your dataset, really. ksql also would need to read from the beginning if it were to crash – OneCricketeer Dec 30 '22 at 03:28
  • For my application to be ready I would have to wait for my application to consume every event since the earliest offset. If my data grows this doesn't seem scalable. I understand ksqldb would need to do the same if it crashed but that would happen rarely while my service is going to be deployed regularly. – Farhan Islam Dec 30 '22 at 06:09
  • It is scalable as long as you don't frequently restart your app, as you just said. The alternative is to always remotely query the view or use ksql push queries – OneCricketeer Dec 30 '22 at 17:32
  • So that is my case where my app will frequently restart. Do you know the answer to my original question of being able to retrieve the latest offset from a remote query? – Farhan Islam Dec 30 '22 at 18:51
  • Can you not use persistent storage for your app? As in, RocksDB instead of a Hashmap? And therefore can use **persistent** statestores that are _saved across restarts_, and **don't** need to consume from the beginning unless that storage is lost? For offsets, as the answer below says, use listOffsets method for the ksql app consumer group id – OneCricketeer Dec 30 '22 at 23:51
  • No, I can't use persistent storage for my app. I would like to but with my business use case, I need to iterate over every single record within a hashmap for processes that are happening almost every second. – Farhan Islam Dec 31 '22 at 00:20
  • RocksDB also supports full table scans – OneCricketeer Dec 31 '22 at 00:23
  • Just did some reading on RocksDB, it seems like its an embedded database that would live on my service? If that were the case would it really be persistent storage if my service was to go down and get redeployed? I would have to repopulate it just like I am with my hashmap? no? – Farhan Islam Dec 31 '22 at 00:44
  • What exactly prevents you from writing data to disk? – OneCricketeer Dec 31 '22 at 01:12
  • Performance. I haven't benchmarked RocksDB but can a call to a table scan of RocksDB match the same performance of iterating over an entire in-memory hashmap within the service? – Farhan Islam Dec 31 '22 at 01:16
  • Assuming it's on a SSD, then it'll be close, but obviously nothing will be as fast as RAM access. The tradeoff will be waiting to populate in memory storage upon boot – OneCricketeer Dec 31 '22 at 01:17

1 Answers1

2

You used "materialized view" but I'm going to pretend I heard "table". I have often used materialized views in a historical reporting context, but not with live updates. I assume that yours will behave similar to a "table".

I assume that all events, and DB rows, have timestamps. Hopefully they are "mostly monotonic", so applying a small safety window lets us efficiently process just the relevant recent ones.


The crux of the matter is racing updates. We need to prohibit races.

Each time an instance of a writer, such as your app, comes up, assign it a new name. Rolling a guid is often the most convenient way to do that, or perhaps prepend it with a timestamp if sort order matters.

Ensure that each DB row mentions that "owning" name.

want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the materialized view, and when I started consuming off Kafka Stream A.

We will need a guaranteed monotonic column with an integer ID or a timestamp. Let's call it ts.

  1. Query m = max(ts).
  2. Do a big query of records < m, slowly filling your hashmap.
  3. Start consuming Stream A.
  4. Do a small query of records >= m, updating the hashmap.
  5. Continue to loop through subsequently arriving Stream A records.

Now you're caught up, and can maintain the hashmap in sync with DB.

Your business logic probably requires that you treat DB rows mentioning the "self" guid in a different way from rows that existed prior to startup. Think of it as de-dup, or ignoring replayed rows.


You may find offsetsForTimes() useful.

There's also listOffsets().

J_H
  • 17,926
  • 4
  • 24
  • 44
  • Yes, that's my mistake. I conflated materialized view and ksqlDB table. I updated the question with the correct term. Thank you for this answer. This is what I was looking for! – Farhan Islam Dec 30 '22 at 19:42
  • It seems like ksqlDB offers ROWTIME:https://docs.ksqldb.io/en/latest/how-to-guides/use-a-custom-timestamp-column/#using-event-time . I think I will just leverage this field instead. So on startup I will do a query for all records in the table for ROWTIME <= NOW. I will save that NOW variable then do a ksqlDB PUSH query and for ROWTIME >= NOW variable. This seems like the most straight forward solution. – Farhan Islam Dec 30 '22 at 20:42
  • 1
    Sure, that's fine, go for it. Though I'm slightly sad that incoming events don't have a comparable timestamp. I prefer to read stamps from just the database, and not from my wristwatch or other source that is "mostly" in sync. Also, I was quite deliberate about `< m`, so there is guaranteed to always be at least one "left over" record to attend to. This makes it less racy, and more likely that unit / integration tests verify behavior we care about. For example, I do not assume that all timestamps are distinct. When current time is "noon" and I read all `<= noon`, _another_ noon row may arrive. – J_H Dec 30 '22 at 21:09
  • ROWTIME seems like a system stamp though? From the docs: "ROWTIME is a system-column that ksqlDB reserves to track the timestamp of the event." Also, yes sorry I meant to write ROWTIME < NOW and ROWTIME >= NOW for my 2 queries. – Farhan Islam Dec 30 '22 at 23:51
  • 1
    Yeah, I get where you're coming from. But I hate races; call me paranoid. Two lessons: (1.) The man with two watches never knows what time it is. That is, pay attention to a single timebase. (2.) Transactional boundaries matter, and "read consistency" is a thing. So I am conservatively encouraging you to rely on commited row data, rather than on a wristwatch observation that was taken _outside_ a BEGIN / COMMIT transaction. – J_H Dec 30 '22 at 23:57
  • 1
    I guess I should throw in the obligatory [Lamport clock](https://en.wikipedia.org/wiki/Lamport_timestamp) reference about _happened-before_ causality. It is (host, timestamp) tuples, where each host can be relied on to enforce monotonicity of stamps. Comparing tuples with matching host is valid, but comparing host1's stamp against host2's stamp is not. So if e.g. we could SELECT (guid, timestamp) from the database, I would view that as relatively ideal. We can't, so I'll take what I can get. Some [protocols](https://en.wikipedia.org/wiki/Network_Time_Protocol) similarly separate host clocks. – J_H Dec 31 '22 at 00:05