Single file version in git-lfs

Question

Is there a way to setup git-lfs to store only 1 version of an LFS tracked file? New versions of the file should replace the old. In other works, an old commit should reference the latest (only) version of the LFS files.

I want to do this to keep the repository size down and still be able to sync the latest binaries among all clones. I don't have any need to track changes to files that are put into LFS.

For example, if elephant.bin is modified I want the original elephant.bin deleted from .git/lfs/objects (and anywhere else it might be stored) before adding the new elephant.bin.

I am contemplating doing this with symlinks to the binaries or on trying to figure out git-annex. As those should achieve my goal. Yet, if there's a way I can avoid managing symlinks and stick to using popular git-lfs, that's preferred.

The closest related question I found was Multiple file versions in git-lfs

That's sort of the opposite of version control (non-version non-control? noversion decontrol?). Sounds like you're best served by not putting the files in at all. Stick a tarball of the binaries somewhere retrievable if you want that, and retrieve it and expand it into place and use `.gitignore` to keep Git from complaining about the untracked-and-ignored binaries. — torek, Aug 23 '18 at 22:53
@torek , that would also get the job done. It would be another alternative similar to symlinks or git-annex. I want to know if there's a way to do it with git-lfs. — ShawnFeatherly, Aug 23 '18 at 23:12

score 2 · Accepted Answer · answered Aug 24 '18 at 04:57

This cannot reasonably be done with out-of-the-box LFS functionality. In git, every bit of content is integral to the commits containing that content; and this is true even with LFS[1]. The bottom line is, you'd have to rewrite the entire repo's history every time you changed one of those files. This is cumbersome to do, and also if anyone else has a copy of the repo, each history rewrite will ruin their clones.

I've tried to think through what you'd have to do to make something like this work. With a combination of hooks and filters, you could at least get pretty close - but it would be a lot of work, I can't see how to quite make it work right, and frankly there's not much point to it.

The reason there's not much point to it is, LFS already lets you control the size of your local LFS store by pruning objects that are no longer relevant. It's true that if you check out an old commit (without suppressing LFS) you'll re-download any of its files that you've pruned; so if you absolutely must keep the latest version of the file even when checking out historical versions (rather than simply being willing to tolerate such a behavior), or if somehow in the year 2018 you can't find enough storage even for your central LFS store to keep all versions, then you'll need to work out some other solution.

But if so, you'll need to look outside of LFS for that solution.

[1] - In case you want more detail on that claim: LFS uses an SHA256 hash of your file as that file's "filename". To say it would be unlikely for two different versions of a file to hash to the same "filename" is a massive understatement. What LFS stores in the git repo is a "pointer file", encoded like any other file under git control (a BLOB), whose contents include the LFS object's "filename". So changing the contents of the file under LFS control changes the contents of the pointer file.

Now, a BLOB in git is named using an SHA hash. Although this has fewer bits than a SHA256, it remains unreasonable to believe that any two different BLOBs will hash to the same ID. (Indeed, if that ever does happen in a single repo, it will break git; but nobody who understands the math is worried.) So changing the version of the file in LFS changes the ID of the pointer file in the repo.

From here it's more of the same. The BLOB is listed (with its ID) in a TREE; so that TREEs content has to change, so that TREEs ID has to change. That TREE may be listed as a "subdirectory" under another TREE, and if so that TREEs ID would change, and so on recursively until finally you reach the root TREE for the COMMIT. The COMMIT metadata includes the TREE ID, so even the COMMIT ID must change.

Once the COMMIT ID has to change, that means you're talking about a different COMMIT altogether.

So it is truly impossible to change the contents of an existing commit, even where LFS is involved. You can create a slightly mutated copy of the commit, but substituting that into the history is a rewrite.

Thanks for the thorough thought out answer. Now it makes sense why it's impossible with git-lfs. I'm curious on the hooks and filters idea, sounds like it could automate everything with vanilla git. I’m trying to think through why it’d be a lot of work. It could extract a zip / tarball stored inside of the `.git` folder on top of the clone every time a sync is done. That zip / tarball would overwrite the remotes if its date is newer. — ShawnFeatherly, Aug 24 '18 at 18:38

Single file version in git-lfs

1 Answers1