14

Is there a distributed version control system (git, bazaar, mercurial, darcs etc.) that can handle files larger than available RAM?

I need to be able to commit large binary files (i.e. datasets, source video/images, archives), but I don't need to be able to diff them, just be able to commit and then update when the file changes.

I last looked at this about a year ago, and none of the obvious candidates allowed this, since they're all designed to diff in memory for speed. That left me with a VCS for managing code and something else ("asset management" software or just rsync and scripts) for large files, which is pretty ugly when the directory structures of the two overlap.

Jon Seigel
  • 12,251
  • 8
  • 58
  • 92
joelhardi
  • 11,039
  • 3
  • 32
  • 38

7 Answers7

13

It's been 3 years since I asked this question, but, as of version 2.0 Mercurial includes the largefiles extension, which accomplishes what I was originally looking for:

The largefiles extension allows for tracking large, incompressible binary files in Mercurial without requiring excessive bandwidth for clones and pulls. Files added as largefiles are not tracked directly by Mercurial; rather, their revisions are identified by a checksum, and Mercurial tracks these checksums. This way, when you clone a repository or pull in changesets, the large files in older revisions of the repository are not needed, and only the ones needed to update to the current version are downloaded. This saves both disk space and bandwidth.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
joelhardi
  • 11,039
  • 3
  • 32
  • 38
10

No free distributed version control system supports this. If you want this feature, you will have to implement it.

You can write off git: they are interested in raw performance for the Linux kernel development use case. It is improbable they would ever accept the performance trade-off in scaling to huge binary files. I do not know about Mercurial, but they seem to have made similar choices as git in coupling their operating model to their storage model for performance.

In principle, Bazaar should be able to support your use case with a plugin that implements tree/branch/repository formats whose on-disk storage and implementation strategy is optimized for your use case. In case the internal architecture blocks you, and you release useful code, I expect the core developers will help fix the internal architecture. Also, you could set up a feature development contract with Canonical.

Probably the most pragmatic approach, irrespective of the specific DVCS would be to build a hybrid system: implement a huge-file store, and store references to blobs in this store into the DVCS of your choice.

Full disclosure: I am a former employee of Canonical and worked closely with the Bazaar developers.

ddaa
  • 52,890
  • 7
  • 50
  • 59
  • Thanks very much for the reply. I did correspond with some Hg and BZR developers last year and what they said mirrors your assessment -- the BZR folks said "Hmm that's interesting, you could code it" and we considered it but the time cost didn't make sense compared to just using SVN or hacking ... – joelhardi Sep 17 '08 at 02:58
  • ... up some hybrid solution where we're committing file hashes or something. The DVCS projects all seem to be heavily driven by the distributed FOSS development use case, unlike SVN and commercial products, which have a wider range of uses in mind. Hg and BZR are great projects, so too bad for me. – joelhardi Sep 17 '08 at 03:06
4

Yes, Plastic SCM. It's distributed and it manages huge files in blocks of 4Mb so it's not limited by having to load them entirely on mem at any time. Find a tutorial on DVCS here: http://codicesoftware.blogspot.com/2010/03/distributed-development-for-windows.html

pablo
  • 6,392
  • 4
  • 42
  • 62
  • Thanks for the tip, I'm no longer working on this problem but your answer will be useful to people reading this thread. From their website, there appears to be Linux/BSD/OS X support for Plastic SCM since it's C#/Mono. They're using SQL for backend storage, however, so I'm still skeptical of "large file" support/performance ... by which I originally meant things up to, say, DV video sources in the 1-10 G range. Chunking/diffing something like that out of SQLite *may* work, but how well? If anybody has any experience with this, it would be great info to add. – joelhardi Jun 14 '11 at 21:47
  • 1
    Hi, actually we just run another test with 2Gb files... it is all about storing 4mb blobs on a database, which is... extremely fast... using SQL Server, or Firebird or even MySQL... Plastic has an option to save files on fs too. – pablo Jun 16 '11 at 16:28
3

BUP might be what you're looking for. It was built as an extension of git functionality for doing backups, but that's effectively the same thing. It breaks files into chunks and uses a rolling hash to make the file content addressable/do efficient storage.

Catskul
  • 17,916
  • 15
  • 84
  • 113
2

I think it would be inefficient to store binary files in any form of version control system.

The better idea would be to store meta-data textfiles in the repository that reference the binary objects.

pobk
  • 9,435
  • 1
  • 17
  • 12
  • Thanks for your response. But yes, I did mean what I asked. I do need to version large files -- there is another class of software "enterprise asset management" that is basically VCS/Aperture/Version Cue on a server for media assets. – joelhardi Sep 16 '08 at 10:17
  • 1
    I think the point I was trying to make (not enough coffee I'm afraid) was that the majority of VCS systems aren't designed to version binary objects. As you say, they do in-memory diffs and store the delta... There's little point to versioning binaries since they are intrinsic. – pobk Sep 16 '08 at 15:29
1

Does it have to be distributed? Supposedly the one big benefit subversion has to the newer, distributed VCSes is its superior ability to deal with binary files.

  • Thanks for the answer, but yes, it does. I agree that SVN does handle binary files well -- which is part of what mystifies me that the VCSes I previously tested acted as if segfaulting on a 400 MB file is acceptable behavior. – joelhardi Sep 16 '08 at 10:21
0

I came to the conclusion that the best solution in this case would be to use the ZFS.

Yes ZFS is not a DVCS but:

  • You can allocate space for repository via creating new FS
  • You can track changes by creating snapshots
  • You can send snapshots (commits) to another ZFS dataset