How to replay Event Sourcing events reliably?

Question

One of great promises of Event Sourcing is the ability to replay events. When there's no relationship between entities (e.g. blob storage, user profiles) it works great, but how to do replay quckly when there are important relationships to check?

For example: Product(id, name, quantity) and Order(id, list of productIds). If we have CreateProduct and then CreateOrder events, then it will succeed (product is available in warehouse), it's easy to implement e.g. with Kafka (one topic with n1 partitions for products, another with n2 partitions for orders).

During replay everything happens more quickly, and Kafka may reorder the events (e.g. CreateOrder and then CreateProduct), which will give us different behavior than originally (CreateOrder will now fail because product doesn't exist yet). It's because Kafka guarantees ordering only within one topic within one partition. The easy solution would be putting everything into one huge topic with one partition, but this would be completely unscalable, as single-threaded replay of bigger databases could take days at least.

Is there any existing, better solution for quick replaying of related entities? Or should we forget about event sourcing and replaying of events when we need to check relationships in our databases, and replaying is good only for unrelated data?

You need to build the replay logic to validate that there is a relationship to form just like you probably needed when the original events were created (original data makes no guarantee on time or event-rate, either, as that is only a perception). Kafka itself isn't reordering events since the topics are immutable — OneCricketeer, Aug 25 '21 at 14:17
When you have >1 partition per topic, Kafka can (and typically will) reorder event delivery, only order within one partition is guaranteed. Everyone usually has >1 partition per topic to improve scalability. — iirekm, Aug 25 '21 at 14:27
That is the consumer client, not Kafka doing that, and no, offset order within single partitions is not changed when polling — OneCricketeer, Aug 25 '21 at 14:31
It is correct to say that from the consumer's perspective the order in which messages in different Kafka partitions are received is not defined. — Levi Ramsey, Aug 25 '21 at 16:17
Using Kafka for Event Sourcing is problematic. See: https://serialized.io/blog/apache-kafka-is-not-for-event-sourcing/ — CraigTP, Aug 27 '21 at 08:25
Their problem is different, they want to make event store almost a SQL database (querying events, optimistic locking on events, etc). For most event sourcing applications it's not needed. Even if you stored messages in an (ACID-compliant) SQL database, when you wanted to scale it (process in multiple threads or machines), you would get the same problem. — iirekm, Aug 27 '21 at 18:02
@iirekm that's where something like Akka (specifically Cluster Sharding and Persistence) comes in. — Levi Ramsey, Aug 27 '21 at 19:25

score 0 · Answer 1 · answered Aug 25 '21 at 16:36

As a practical necessity when event sourcing, you need the ability to conjure up a stream of events for a particular entity so that you can apply your event handler to build up the state. For Kafka, outside of the case where you have so few entities that you can assign an entire topic partition to just the events for a single entity, this entails a linear scan and filter through a partition. So for this reason, while Kafka is very likely to be a critical part of any event-driven/event-based system in relaying events published by a service for consumption by other services (at which point, if we consider the event vs. command dichotomy, we're talking about commands from the perspective of the consuming service), it's not well suited to the role of an event store, which are defined by their ability to quickly give you an ordered stream of the events for a particular entity.

The most popular purpose-built event store is, probably, the imaginatively named Event Store (at least partly due to the involvement of a few prominent advocates of event sourcing in its design and implementation). Alternatively, there are libraries/frameworks like Akka Persistence (JVM with a .Net port) which use existing DBs (e.g. relational SQL DBs, Cassandra, Mongo, Azure Cosmos, etc.) in a way which facilitates their use as an event store.

Event sourcing also as a practical necessity tends to lead to CQRS (they go together very well: event sourcing is arguably the simplest possible persistence model capable of being a write model, while its nearly useless as a read model). The typical pattern seen is that the command processing component of the system enforces constraints like "product exists before being added to the cart" (how those constraints are enforced is generally a question of whatever concurrency model is in use: the actor model has a high level of mechanical sympathy with this approach, but other models are possible) before writing events to the event store and then the events read back from the event store can be assumed to have been valid as of the time they were written (it's possible to later decide a compensating event needs to be recorded). The events from within the event store can be projected to a Kafka topic for communication to another service (the command processing component is the single source of truth for events).

From the perspective of that other service, as noted, the projected events in the topic are commands (the implicit command for an event is "update your model to account for this event"). Semantically, their provenance as events means that they've been validated and are undeniable (they can be ignored, however). If there's some model validation that needs to occur, that generally entails either a conscious decision to ignore that command or to wait until another command is received which allows that command to be accepted.

I thought about waiting. Technically it's possible in Kafka (`Consumer.seek`, `Consummer.commit`). Unfortunately it doesn't solve problem entirely: Suppose that user A calls `CreateProduct` and then user B calls `CreateOrder` and user C call `CreateOrder` at the same time. There's only one piece of the product in the warehouse. The `CreateOrder` which is delivered first wins. Because this is a huge store, we would like to use multiple Kafka partions. Because of undefined order of messages from various partitions, during replay the order between user B and C can be reversed. — iirekm, Aug 26 '21 at 06:30
You need to not consider `CreateProduct` or `CreateOrder` as an event, but as a command, meaning that they can be invalid. — Levi Ramsey, Aug 26 '21 at 10:48
No matter if you allow only valid 'events' or potentially invalid 'commands', the problem is still the same: if you try to scale it by adding more partitions, you get mess. — iirekm, Aug 27 '21 at 18:04

score 0 · Answer 2 · answered Jun 02 '22 at 06:59

Ok, you are still thinking how did we developed applications in last 20 years instead of how we should develop applications in the future. There are frameworks that actually fits the paradigms of future perfectly, one of those, which mentioned above, is Akka but more importantly a sub component of it Akka FSM Finite State Machine, which is some concept we ignored in software development for years, but future seems to be more and more event based and we can't ignore anymore.

So how these will help you, Akka is a framework based on Actor concept, every Actor is an unique entity with a message box, so lets say you have Order Actor with id: 123456789, every Event for Order Id: 123456789 will be processed with this Actor and its messages will be ordered in its message box with first in first out principle, so you don't need a synchronisation logic anymore. But you could have millions of Order Actors in your system, so they can work in parallel, when Order Actor: 123456789 processing its events, an Order Actor: 987654321 can process its own, so there is the parallelism and scalability. While your Kafka guaranteeing the order of every message for Key 123456789 and 987654321, everything is green.

Now you can ask, where Finite State Machine comes into play, as you mentioned the problem arise, when addProduct Event arrives before createOrder Event arrives (while being on different Kafka Topics), at that point, State Machine will behave differently when Order Actor is in CREATED state or INITIALISING state, in CREATED state, it will just add the Product, in INITIALISING state probably it will just stash it, until createOrder Event arrives.

These concepts are explained really good in this video and if you want to see a practical example I have a blog for it and this one for a more direct dive.

iirekm · Accepted Answer · 2021-08-28T06:15:17.670

-1

I think I found the solution for scalable (multi-partition) event sourcing:

create in Kafka (or in a similar system) topic named messages
assign users to partitions (e.g by murmurHash(login) % partitionCount)
if a piece of data is mutable (e.g. Product, Order), every partition should contain own copy of the data
if we have e.g. 256 pieces of a product in our warehouse and 64 partitions, we can initially 'give' every partition 8 pieces, so most CreateOrder events will be processed quickly without leaving user's partition
if a user (a partition) sometimes needs to mutate data in other partition, it should send a message there:
- for example for Product / Order domain, partitions could work similarly to Walmart/Tesco stores around a country, and the messages sent between partitions ('stores') could be like CreateProduct, UpdateProduct, CreateOrder, SendProductToMyPartition, ProductSentToYourPartition
- the message will become an 'event' as if it was generated by an user
- the message shouldn't be sent during replay (already sent, no need to do it twice)

This way even when Kafka (or any other event sourcing system) chooses to reorder messages between partitions, we'll still be ok, because we don't ever read any data outside our single-threaded 'island'.

EDIT: As @LeviRamsey noted, this 'single-threaded island' is basically actor model, and frameworks like Akka can make it a bit easier.

edited Aug 28 '21 at 06:15

answered Aug 27 '21 at 17:59

iirekm

8,890
5
36
46

So every change to every product and every order gets published to every partition... how exactly is this different from a single partition? – Levi Ramsey Aug 27 '21 at 19:22
The idea of aiming for a single-threaded island of consistency is exceptionally powerful though... you may want to read up on the actor model (e.g. Akka on the JVM). – Levi Ramsey Aug 27 '21 at 19:24
Changes like product name should be propagated to every partition, but things like orders or remaining product counts in partitions' 'virtual warehouses' can be partition-private, so in most cases `CreateOrder` can be processed in 1 partition. Akka looks great, but the question is: how fast it is compared to Kafka? According to benchmarks, if we want don't want to store messages, persistenceless queues like ZeroMQ are fastest, but if we want to store messages (e.g. event sourcing), Kafka usually wins. Another issue may be adoption: many devs today know Kafka, which can't be said of Akka. – iirekm Aug 28 '21 at 05:34
Since with a Kafka solution, you're likely either doing a linear scan through a partition and ignoring most of the messages, or if you have N entities using a partition, you're processing all N in the same thread one message at a time, I'd feel comfortable saying that an Akka (using Cassandra as the persistence backend) solution will, especially if we're talking about (at least) thousands of stores, hundreds of thousands of products, and millions of orders beat your proposed Kafka solution by 10x, maybe 100x per core. – Levi Ramsey Aug 28 '21 at 15:14

How to replay Event Sourcing events reliably?

3 Answers3