How should I serialize domain model snapshots for event sourcing

Question

We are building an application using the LMAX Disruptor. When using Event Sourcing, you often want to persist periodic snapshots of your domain model (some people call this the Memory Image pattern).

I need a better solution than what we are currently using to serialize our domain model when taking a snapshot. I want to be able to "pretty-print" this snapshot in a readable format for debugging, and I want to simplify snapshot schema migration.

Currently, we are using Googles' Protocol Buffers to serialize our domain model to a file. We chose this solution, because protocol buffers are more compact than XML / JSON, and using a compact binary format seemed like a good idea to serialize a big Java domain model.

The problem is, Protocol Buffers were designed for relatively small messages, and our domain model is quite big. So the domain model does not fit in one big hierarchical protobuf message, and we end up serializing various protobuf messages to a file, like this:

for each account {
    write simple account fields (id, name, description) as one protobuf message
    write number of user groups
    for each user group {
        convert user group to protobuf message, and serialize it
    }
    for each user {
        convert user to protobuf message, and serialize it
    }
    for each sensor {
        convert sensor to protobuf message, and serialize it
    }
    ...
}

This is annoying, because manipulating a stream of heterogenous protobuf messages is complicated. It would be a lot easier if we had one big protobuf message that contained all of our domain model, like this:

public class AggregateRoot {
    List<Account> accounts;
}

--> convert to big hierarchical protobuf message using some mapping code:

message AggregateRootMessage {
    repeated AccountMessage accounts = 1;
}

--> persist this big message to a file

If we do this, it's easy to prettyprint a snapshot: simply read the big protobuf message, then prettyprint it using protobuf's TextFormat. With our current approach, we need to read the various protobuf messages one by one, and pretty-print them, which is harder, since the order of the protobuf messages in the stream depends on the current snapshot schema, so our pretty-printing tool needs to be aware of that.

I also need a tool to migrate snapshots to the new snapshot schema when our domain model evolves. I'm still working on this tool, but it's hard, because I have to deal with a stream of various protobuf messages, instead of dealing with just one big message. If it were just one big message, I could: - take the snapshot file - parse the file as a big Java protobuf message, using the .proto schema for the previous snapshot version - convert this big protobuf message into a big protobuf message for the new version, using Dozer and some mapping code - write this new protobuf message in a new file, using the .proto schema for the new version

But since I am dealing with a stream of protobuf messages of various types, my tool needs to handle this stream in the correct order.

So, yeah... I guess my questions are:

Do you know any serialization tool that can serialize a big domain model into a file, without protobuf's limitations, possibly using streaming to avoid OutOfMemorryErrors?
If you use event sourcing or memory images, what do you use to serialize your domain model? JSON? XML? Protobuf? Something else?
Are we doing it wrong? Do you have any suggestions?

score 5 · Answer 1 · answered Jun 07 '13 at 11:27

The way I would define a solution to the problem is by separating the 'specification' from 'transfer syntax'. Now, that we have defined our message specifications we need to work on wire-line representation which may support different needs varying between machine efficiency and human readability say;

binary mode - least verbose but not human readable
character - that represents commands and params is more readable and also provides robust storage
clear text - say for debug purpose

The solution must provide switchable behavior. We can base our solution on ASN.1 and related tool-set which is both language and platform agnostic, although, a rich ecosystem is available with Java (Bouncycastle et al). We have used it with fairly large message blobs over the network with no known issues :)

Hope it gives some pointers.

My ideal answer would have been from someone who encountered a similar problem (snapshots in event sourcing) and could describe his solution. But the bounty is expiring in 15 minutes, and your answer, while theoric, gave me the most food for thought, so I'll give you the bounty :) I'll wait for other answers before "accepting an answer", though. — Etienne Neveu, Jun 11 '13 at 09:34

score 2 · Answer 2 · answered Jun 03 '13 at 20:51

2

Just from the top of my head (without actually knowing how big your snapshot files would get):

Have you tried Google's Gson JSON library? It seems to provide both versioning (https://sites.google.com/site/gson/gson-user-guide#TOC-Versioning-Support) and streaming (https://sites.google.com/site/gson/streaming) for JSON-based documents.

And now that we are talking JSON, how about storing the snapshots in e.g. CouchDB (http://en.wikipedia.org/wiki/CouchDB) documents?

JSON may take a bit more space but it is readable.

answered Jun 03 '13 at 20:51

Jukka

4,583
18
14

1

This might or might not be useful: http://contourline.wordpress.com/2012/01/18/how-big-is-too-big-for-documents-in-couchdb-some-biased-and-totally-unscientific-test-results/ – Jukka Jun 03 '13 at 20:57
JSON is indeed an approach we could use, and I love using human-readable file formats. But they chose to use a binary format for serialization to save space (before I joined the project), and I'm not sure I want to argue for a switch to JSON if it's going to pose problems later down the road when the snapshot becomes too big... I'm considering it, but I'm also interested in other approaches. If someone answers that he uses JSON on his Event Source project with a huge domain model, that may help me argue for JSON :) If we end up choosing JSON, I'll push for GSON, since I love this library :) – Etienne Neveu Jun 05 '13 at 13:07
1

Have a look at BSON and MongoDB as well: http://docs.mongodb.org/manual/core/document/ – Jukka Jun 05 '13 at 14:48

score 1 · Answer 3 · answered Jun 07 '13 at 02:21

1

The best list of options I've see is here: https://github.com/eishay/jvm-serializers/wiki. You'll have to do some quick tests to see what's fast for you. Regarding streaming, I'd have to look through each of the libraries in this list.

Not sure I understand the pretty printing problem. It doesn't seem necessary to solve efficient serialization and pretty printing with the same technology, since surely pretty printing doesn't have to be done super efficiently. If you already have a javabean representation, then I'd probably reload the data into beans, and then use Jackson to print the data to JSON.

Regarding versioning/migrations, have you already solved the problem of how to start a new version of the code that's running the new domain model? If yes, then why not just create a new snapshot after the new version starts?

answered Jun 07 '13 at 02:21

jtoberon

8,706
1
35
48

We want pretty printing for debugging purposes, e.g. to see if something was serialized incorrectly in our snapshot. It's ok if the serialized form itself is not pretty-printed, as long as we have a tool that can display the content of a snapshot. Reloading the data into domain model objects, and printing those using Jackson is indeed a solution (I was actually considering XStream, since it supports circular object references). When migrating versions, we plan to stop the server, migrate the snapshot, restart the server using the new snapshot, and replay the events received in between. – Etienne Neveu Jun 10 '13 at 13:15
Got it. I'll update my answer based on this info & a few new assumptions. – jtoberon Jun 10 '13 at 14:20

How should I serialize domain model snapshots for event sourcing

3 Answers3