3

I have a repository that has grown too big to the point it has become unusable. Basically my repository is over 2GB and takes too long to clone. I now want to shrink it, but still be able to go back to some specific old versions... Shrinking will involve rewriting history, so i m fine with that. People with clones will have to rebase/cherrypick/copyfiles on top of new branch in new repo clone.

  • I have binary files in this repository but I need them there ( think of it as mandatory resource for the software to run ). So I cannot really use filter-branch or BFG to remove some big binary files, since i may need them when reverting to past commits.
  • I do not care of previous old/already merged branches ( example : features branches ), but I care about some specific commits ( example heads of past release branches )
  • Since I ll be modifying (~many~) very old commits, I have no idea now of how to solve properly merge conflicts ( as can happen with basic rebase/cherrypick ) so I m looking for a solution that doesnt produce any conflicts, or produces only conflicts that can be solved automatically.
  • I want to preserve all current branches, so people who have work going on on a clone can rebase/copychanges on them.
  • I want to have relevant history between my new commits to match the history from the old repo ( as if the commits were squashed ). The current branches' history will start from one of these old squashed commits.

I think of it as a squash of unneeded old repository history. What I came up so far as a possible process for my case ( I miss some steps and I am still unsure this will do what I think ) is :

  • clone a mirror of the existing repo.
  • Create orphan branches from the old commits I want to keep. This will create parentless squashed commits with all files needed in them.
  • Somehow link them to recreate old repo history => How ? merge / rebase / reset+commit orphans ?
  • Cherrypick each current branch's commit list (using intervals), and applying them to the latest commit that squashed the parent of their first divergent commit => How to automatically find which commit to apply a cherry picked commit interval to ? Will that work without conflicts ?
  • Move tags to the new tree. Remove previous tree. git garbage collect.

Is this doable / feasible without any conflicts ? Will this work in any kind of cases ( git commit tree can be pretty complex ) ? Any better solution to safely and automatically squash history ?

It seems to me this type of maintenance task is something that will happen for a long running project, so I'm assuming other big projects already used some type of solution. But I guess there could be an option to git init ( or another command ) that I am not aware of, to create a new repo from an old repo for this usecase ?

Update : I found a beginning of solution here : https://wincent.com/wiki/Editing,_amending,_or_squashing_the_root_commit_in_a_Git_repository But I would like to do this multiple times into my history, in a fully automatic way (ie without conflicts)...

Asmodehn
  • 41
  • 5
  • Are you sure this will actually shrink the history? If you have large binary files chances are that is what is taking up the space, and not the commits themselves. You can dump the blob size for your large objects and see what percentage of the 2GB they make up, that will give you a bounds of what improvement you could achieve. – Andrew C Oct 24 '14 at 20:08
  • Once the commits are squashed the binary files that were referenced in these commit wont be used anymore and can be garbage collected... I think.Thanks for the blob size tip it can be useful to check. – Asmodehn Oct 26 '14 at 01:49

3 Answers3

1

You can clone just a part of the repo:

git clone --depth depth 

This is called a shallow clone.

The was a post on the Atlassian blog a while ago that offers other strategies for dealing with a large repo.

Richard Hulse
  • 10,383
  • 2
  • 33
  • 37
  • I find shallow clone useful only if you want to get a big repository as "read only". Otherwise you ll need some other way to make the actual repo smaller. Not only your local clone. – Asmodehn Oct 26 '14 at 01:52
1

OK so after a few days of trial and errors, here is the solution that I find best :

1) From the commit you want to use as new root, do a checkout --orphan to create an orphan branch, and commit your changed files for this version.

2) For each commit C that you want to keep, checkout commit C, reset to previous new commit B', commit to make a C' new commit, with B' as parent. (thanks forvaidya for the link)

3) You now need to relink existing branches to the last commit that you kept. Find that commit in the old history. from there, list all commits that have it( or any of its parent) as a direct parent. Then you can use the new git replace --graft to replace their old parent with the new commit.

It will be very useful to come up with a foolproof script for this though... I ll post it here if i ever do it.

Warning : The step 3) works only if you are using git 2.X. 1.X git clients will not see the change in the commit graph.

Community
  • 1
  • 1
Asmodehn
  • 41
  • 5
  • did this actually shrink your repo? – Andrew C Oct 27 '14 at 03:21
  • So yes it did shrink. However less than i anticipated... my repo was 2.0GB before. With a tree of 1.1GB. After that operation the repo went to 1.6GB. Most of it are image files of > 2 MB but i m not sure of the details of what happened in the repository. Some of my users are on windows and cant go to git 2.1 so I ll have to just create a new repo from the tip of the existing master branch. – Asmodehn Oct 27 '14 at 04:22
0

Git shallow clone is one answer but with shallow clones you cannot Push.

As far as squash concerned squash is good only on unpublished history, this link may be useful http://www.awanitech.com/git-squash.html

Any squash done after push need to be committed on different branch as it is not FF push. Such squash will not have effect on repository size.

If you are ready to do force push (history rewrite); then you can do filter-branch and reduce size.

If your bad versions are on entirely different branch ; you can create a git bundle and make that as abridged repository.

forvaidya
  • 3,041
  • 3
  • 26
  • 33
  • My repository is private and I can tell all users to rebase so it s not an issue. But i want to remove the commits squashed afterwards so that unused binary blobs can be garbage collected. – Asmodehn Oct 26 '14 at 01:54
  • As you have said your repository local, back up existing git repository. Squash using method described in accompanying link. Go another folder and make a clone from your original project and it will not have unwanted commits. (of course making another clone is akin to force push) – forvaidya Oct 26 '14 at 03:13
  • Thanks that solves the first part of my question, that is compressing the old history. but then i still need to somehow link the recent history on top, without conflicts... – Asmodehn Oct 26 '14 at 05:56