0

I'm planning in test how make this kind of architecture to work:

http://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/

Where all the data is stored as facts in a log, but the validations when posted a change must be against a table. For example, If I send a "Create Invoice with Customer 1" I will need to validate if the customer exist and other stuff, then when the validation pass commit to the log and put the current change to the table, so the table have the most up-to-date information yet I have all the history of the changes.

I could put the logs into the database in a table (I use PostgreSql). However I'm concerned about the scalability of doing that, also, I wish to suscribe to the event stream from multiple clients and PG neither other RDBMS I know let me to do this without polling.

But if I use Kafka I worry about the ACID between both storages, so Kafka could get wrong data that PG rollback or something similar.

So:

1- Is possible to keep consistency between a RDBMS and a log storage OR 2- Is possible to suscribe in real time and tune PG (or other RDBMS) for fast event storage?

mamcx
  • 15,916
  • 26
  • 101
  • 189
  • It is not clear what you want to achieve with such a setup vs just using a db. Is change log the only thing you want to get out of it? – Tim Sep 04 '16 at 01:34
  • And the ability to subscribe to it from several clients. I worry it could put a lot of pressure of the DB because I will need to use polling. – mamcx Sep 05 '16 at 17:10

1 Answers1

0

Easy(1) answers for provided questions:

  1. Setting up your transaction isolation level properly may be enough to achieve consistency and not worry about DB rollbacks. You still can occasionally create inconsistency, unless you set isolation level to 'serializable'. Even then, you're guaranteed to be consistent, but still could have undesirable behaviors. For example, client creates a customer and puts an invoice in a rapid succession using an async API, and invoice event hits your backed system first. In this case invoice event would be invalidated and a client will need to retry hoping that customer was created by that time. Easy to avoid if you control clients and mandate them to use sync API.

  2. Whether it is possible to store events in a relational DB depends on your anticipated dataset size, hardware and access patterns. I'm a big time Postgres fan and there is a lot you can do to make event lookups blazingly fast. My rule of thumb -- if your operating table size is below 2300-300GB and you have a decent server, Postgres is a way to go. With event sourcing there are typically no joins and a common access pattern is to get all events by id (optionally restricted by time stamp). Postgres excels at this kind of queries, provided you index smartly. However, event subscribers will need to pull this data, so may not be good if you have thousands of subscribers, which is rarely the case in practice.

"Conceptually correct" answer: If you still want to pursue streaming approach and fundamentally resolve race conditions then you have to provide event ordering guarantees across all events in the system. For example, you need to be able to order 'add customer 1' event and 'create invoice for customer 1' event so that you can guarantee consistency at any time. This is a really hard problem to solve in general for a distributed system (see e.g. vector clocks). You can mitigate it with some clever tricks that would work for your particular case, e.g. in the example above you can partition your events by 'customerId' early as they hit backend, then you can have a guarantee that all event related to the same customer will be processed (roughly) in order they were created.

Would be happy to clarify my points if needed.

(1) Easy vs simple: mandatory link

Tim
  • 2,008
  • 16
  • 22
  • 1) Exist a list of resources or book where are that "tricks" that could be used? 2)I think the data will not be that big, my potential customers are small shop owners, and I think that instead of write the log to kafka directly in the database, I write the log to a table, then pull the log INTO kafka eventually (so only 1 client against DB) and then use it for distribute the data for the subscribers. So I have DB -> LogInDb -> Pull -> LogInKafKa -> PUSH -> Clients. – mamcx Sep 09 '16 at 00:02
  • Not that I'm aware of. It depends on what trade-offs your application can tolerate. I imagine it'll be hard to generalize. – Tim Sep 09 '16 at 00:04