1

In Martin Fowler's write-up of the LMAX-disruptor architecture, he says:

The journaler's job is to store all the events in a durable form, so that they can be replayed should anything go wrong. LMAX does not use a database for this, just the file system. They stream the events onto the disk.

I'm curious what the implementation of the file system based event log looks like in practice. The following answer says that it is written to a "raw file", but I'm interested in the actual details that one might implement for a production system. Is it literally a raw text file containing a structured log that is continuously appended to? Or is it some sort of binary format? Are there any critical design decisions that go into this component of the system?

JoshAdel
  • 66,734
  • 27
  • 141
  • 140

2 Answers2

2

The journaller is just another consumer of the main application ring-buffer. Messages are read off the wire, a header added (received timestamp, etc) then fed into the ring-buffer.

There are three consumers:

  1. Application handler (invokes business logic)
  2. Replication sender (replicates messages to a secondary)
  3. Journalling handler

The application handler is gated on the journaller completing, and an ack from the secondary, ensuring that received messages are in the secondary's ring-buffer, and the local system page-cache before application messages are processed.

The journaller is extremely dumb - messages are appended to a fixed-length journal file in wire format. The file is pre-allocated in advance and various file-system mount options were used to minimise write-latency. In the end, we found XFS to be the best file-system option, but ONLY if there are no concurrent readers of the journal file that is being written. Otherwise there can be nasty locking effects in the XFS code.

I wrote all this up in excruciating detail if you're interested in how we got to these conclusions:

https://epickrram.blogspot.co.uk/2015/05/improving-journalling-latency.html

https://epickrram.blogspot.co.uk/2015/07/seek-write-vs-pwrite.html

https://epickrram.blogspot.co.uk/2015/12/journalling-revisited.html

Mark Price
  • 296
  • 1
  • 3
1

The journaller as suggested needs to contain two pieces of information: the event itself as received and some sort of an identifier to track where in the journal you are so that you can pick to start from that record during replay.

Storage format is ultimately your decision, however the following considerations apply:

  • Replays may need to be triggered not just from system crashes but from bugs in your own code. The less manipulation of the input message byte stream the better. Any manipulation of the byte stream introduces a chance of bugs and makes your replay logic very different to "drop bytes back into the input buffer." To me this is probably the biggest decision.

  • Replays should be quick and not contain business logic. A file format that allows your storage device to store sequentially and not require back and forth hopping such as a database with indexes would require is going to be better for performance. The more layers you have between your ring buffer input and your storage layer the slower things will be.

  • Pre-allocated storage on the disk (you could even use a RAW partition) will allow you to write the bytes beginning to end without needing to update directory metadata and freespace tracking areas of the file system. This should simplify and improve performance. As long as this pre-allocation is enough to keep all data between checkpoints you will be fine. This becomes less of a concern over time with improvements in storage devices.

jasonk
  • 969
  • 5
  • 9