5

I am looking for a simple versioning system for a large number of records or files (~50 million, ~100GB unpacked, ~20MB packed). The files are only a few Kilobytes each, and have unique IDs, so I don't mind whether they are stored in a flat structure (table, directory...) or not. On average, each record is changed once a month, but most changes have diffs less than a Kilobyte so it should be easy to compress versions. However, a naive database with one entry for each version would grow too quickly. I need the following operations:

  • basic CRUD operations: create, read, update, delete
  • quick listing of recent changes
  • quick listing of recent changes of a particular record
  • query for changes in a given period of time
  • query for changes by a given user (each edit is associated to some user id and optionally has a commit message as comment)
  • for write operations there must be a commit hook to validate and reject illformed records.

In short, I am looking for a Wiki-like software for simple records or files.

I thought about possible solutions:

  • Put files in a version control system. This gives me replication and many available access tools, so it is my preferred solution. But the amount of data is too large for distributed systems like git. Is anyone using Subversion for a similar task with success?

  • Implement my own versioning in a database or in a file system. I would pobably need to store only compressed records and diffs, would have more work and learn something. This would be my preferred solution, if it was just for fun.

  • Use a versioning file system. This would make setup, replication and access more difficult. Probably I would need to implement my own access API above the file system.

  • Use a versioning database system. Can you suggest some?

  • Use some other existing data store with versioning (MediaWiki?, Amazon Cloud Drive?, ...)

Obviously there are many pathes. Which pathes have been used by others with success for similar or larger amounts of data?

Jakob
  • 3,570
  • 3
  • 36
  • 49
  • Since Subversion is your go-to option, have you tried it? It should scale to a database that size, and will take (binary) diffs of each revision. The main problem would be that it stores a "pristine" copy of each file in a working copy, effectively doubling the size of the database on the _client_. [svn 1.7](http://subversion.apache.org/docs/release-notes/1.7.html#wc-ng) has improved working copy meta-data storage, which might improve things a little. – Peter Davis Jun 11 '11 at 20:18

1 Answers1

0

If you're not averse to having a raw copy of each file on your client (which I imagine is OK, if you're considering svn) then git is probably quite a good solution to your problem. The underlying repository storage will use binary diffs between files as well as between versions, so you should have close to optimal compression there.

With a bare repo and some scripting, you may even be able to get away with not having the current revision checked out: objects are available from the command line and you can create new commits without a checkout.

Andrew Aylett
  • 39,182
  • 5
  • 68
  • 95