Using DVCS for an RDBMS audit trail

Question

I'm looking to implement an audit trail for a reasonably complicated relational database, whose schema is prone to change. One avenue I'm thinking of is using a DVCS to track changes.

^{(The benefits I can imagine are: schemaless history, snapshots of entire system's state, standard tools for analysis, playback and migration, efficient storage, separate system, keeping DB clean. The database is not write-heavy and history is not not a core feature, it's more for the sake of having an audit trail. Oh and I like trying crazy new approaches to problems.)}

I'm not an expert with these systems (I only have basic git familiarity), so I'm not sure how difficult it would be to implement. I'm thinking of taking mercurial's approach, but possibly storing the file contents/manifests/changesets in a key-value data store, not using actual files.

Data rows would be serialised to json, each "file" could be an row. Alternatively an entire table could be stored in a "file", with each row residing on the line number equal to its primary key (assuming the tables aren't too big, I'm expecting all to have less than 4000 or so rows. This might mean that the changesets could be automatically generated, without consulting the rest of the table "file".

^{(But I doubt it, because I think we need a SHA-1 hash of the whole file. The files could perhaps be split up by a predictable number of lines, eg 0 < primary key < 1000 in file 1, 1000 < primary key < 2000 in file 2 etc, keeping them smallish)}

Is there anyone familiar with the internals of DVCS' or data structures in general who might be able to comment on an approach like this? How could it be made to work, and should it even be done at all?

I guess there are two aspects to a system like this: 1) mapping SQL data to a DVCS system and 2) storing the DVCS data in a key/value data store (not files) for efficiency.

^{(NB the json serialisation bit is covered by my ORM)}

score 2 · Answer 1 · answered Jun 18 '11 at 13:55

I've looked into this a little on my own, and here are some comments to share.

Although I had thought using mercurial from python would make things easier, there's a lot of functionality that the DVCS's have that aren't necessary (esp branching, merging). I think it would be easier to simply steal some design decisions and implement a basic system for my needs. So, here's what I came up with.

Blobs

The system makes a json representation of the record to be archived, and generates a SHA-1 hash of this (a "node ID" if you will). This hash represents the state of that record at a given point in time and is the same as git's "blob".

Changesets

Changes are grouped into changesets. A changeset takes note of some metadata (timestamp, committer, etc) and links to any parent changesets and the current "tree".

Trees

Instead of using Mercurial's "Manifest" approach, I've gone for git's "tree" structure. A tree is simply a list of blobs (model instances) or other trees. At the top level, each database table gets its own tree. The next level can then be all the records. If there are lots of records (there often are), they can be split up into subtrees.

Doing this means that if you only change one record, you can leave the untouched trees alone. It also allows each record to have its own blob, which makes things much easier to manage.

Storage

I like Mercurial's revlog idea, because it allows you to minimise the data storage (storing only changesets) and at the same time keep retrieval quick (all changesets are in the same data structure). This is done on a per record basis.

I think a system like MongoDB would be best for storing the data (It has to be key-value, and I think Redis is too focused on keeping everything in memory, which is not important for an archive). It would store changesets, trees and revlogs. A few extra keys for the current HEAD etc and the system is complete.

Because we're using trees, we probably don't need to explicitly link foreign keys to the exact "blob" it's referring to. Justing using the primary key should be enough. I hope!

Use case: 1. Archiving a change

As soon as a change is made, the current state of the record is serialised to json and a hash is generated for its state. This is done for all other related changes and packaged into a changeset. When complete, the relevant revlogs are updated, new trees and subtrees are generated with the new object ("blob") hashes and the changeset is "committed" with meta information.

Use case 2. Retrieving an old state

After finding the relevant changeset (MongoDB search?), the tree is then traversed until we find the blob ID we're looking for. We go to the revlog and retrieve the record's state or generate it using the available snapshots and changesets. The user will then have to decide if the foreign keys need to be retrieved too, but doing that will be easy (using the same changeset we started with).

Summary

None of these operations should be too expensive, and we have a space efficient description of all changes to a database. The archive is kept separately to the production database allowing it to do its thing and allowing changes to the database schema to take place over time.

score 0 · Answer 2 · answered Jun 17 '11 at 02:03

0

If the database is not write-heavy (as you say), why not just implement the actual database tables in a way that achieves your goal? For example, add a "version" column. Then never update or delete rows, except for this special column, which you can set to NULL to mean "current," 1 to mean "the oldest known", and go up from there. When you want to update a row, set its version to the next higher one, and insert a new one with no version. Then when you query, just select rows with the empty version.

answered Jun 17 '11 at 02:03

John Zwinck

239,568
38
324
436

The database schema is very likely to change over time and it would be good to store meta data with the changesets, such as the timestamp and author. Querying the exact state of the system for a given point in time would also not be straightforward (but neither is the proposed approach). I would also like to keep the audit trail separate from the database to reduce unnecessary complexity and be able to rewind/play back the entire system's state, like a dvcs does so well. Plus, if such an approach works, it would be a good plug-in library for other Django users. – Will Hardy Jun 17 '11 at 02:30
When the schema changes, you could just rename the old table(s) and copy their data to the new one(s), only copying the current-version rows. – John Zwinck Jun 17 '11 at 02:36
Oh and when the database is being written to, it's often going to be an import script of sorts, adding a number of rows at once, which is a good fit for the notion of a "changeset". `hg log` would produce a very readable summary of changes. – Will Hardy Jun 17 '11 at 02:37
That would be a good approach to consider if the DVCS avenue doesn't work out, even if I use a separate history table. – Will Hardy Jun 17 '11 at 02:42
By the way, although this is a good advice, my question was about the feasibility and technical details of an experimental approach. I'd like to know more about the more wild possibilities before I begin to choose an approach, including of course the standard SQL audit trail techniques. – Will Hardy Jun 17 '11 at 03:27

score 0 · Answer 3 · answered Jun 17 '11 at 06:12

Take a look at cqrs and Greg Young's event sourcing. I also have a blog post about working in meta events that pin point schema changes within the river of business events.

http://adventuresinagile.blogspot.com/2009/09/rewind-button-for-your-application.html

If you look through my blog, you'll also find version script schemes and you can source code control those.

Using DVCS for an RDBMS audit trail

3 Answers3