-1

There's a self hosted git repository on a Windows Server (Bonobo based if anyone interested). The repository got bloated up because of binary blobs and I'd like to strip out these large blobs along with their whole history.

I was looked at bfg / git filter-branch, bfg-ish, and git filter-repo. My question I think is invariant of these however it sounds like git filter-repo is the most advised.

The big question: should I execute the --strip-blobs-bigger-than 4M on the repository clone (working copy), or should I go straight ahead and manipulate the hosted bare repo what the Bobono manages? If I execute it on the client clone than how will the changes propagate into Bonobo? These changes will be pretty fundamental, will they be even committable?

I already backed up everything, did some filter-repo analysis. I included the blobs in gitignore (although their modification still show as a change).

Csaba Toth
  • 10,021
  • 5
  • 75
  • 121

1 Answers1

-1

I ended up operating on the hosted bare repository. It looks like filter-repo is intended to be used on a clean clone of a repository:

git filter-repo --strip-blobs-bigger-than 4M
Aborting: Refusing to destructively overwrite repo history since
this does not look like a fresh clone.
  (expected freshly packed repo)
Please operate on a fresh clone instead.  If you want to proceed
anyway, use --force.

So I retried on a clean clone and the instruction ran, but then I was clueless what to do next. There were no file changes per se to commit or push, the "meta data" was modified. The operation also interestingly stripped [remote "origin"] and [branch "master"] from the .git/config so I needed to re-establish remote and branch.

So I decided to just go ahead and modify the hosted bare repo. The tool recognizes that it is not a clean clone:

warning: no corresponding .pack: ./objects/pack/pack-f8fc2556f0b95c1a66219fe3ad3fe41d6319a985.idx
Aborting: Refusing to destructively overwrite repo history since
this does not look like a fresh clone.
  (expected freshly packed repo)
Please operate on a fresh clone instead.  If you want to proceed
anyway, use --force.

With forcing the meta data size decreased from 1.3GB to 150MB, similarly as it was executed on the clean clone meta data.

> git filter-repo --force --strip-blobs-bigger-than 4M
Processed 19965 blob sizes
Parsed 3536 commits
New history written in 1.44 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Enumerating objects: 42458, done.
Counting objects: 100% (42458/42458), done.
Delta compression using up to 8 threads
Compressing objects: 100% (12993/12993), done.
Writing objects: 100% (42458/42458), done.
Selecting bitmap commits: 3257, done.
Building bitmaps: 100% (137/137), done.
Total 42458 (delta 33284), reused 37896 (delta 29067), pack-reused 0
Removing duplicate objects: 100% (256/256), done.
Completely finished after 10.20 seconds.

This happens to be a Windows environment, I started off of a clean clone after that, and I had to re-trust the repository in Visual Studio and all that. So far I could push some changes and I'll report back if anything seems to not work.

It's another story if you are dealing with a repository managed by GitHub or other git services, in this case you won't have direct access to the bare repository they manage. Not sure what happens in that case. I guess you can push the meta data change somehow? Someone should comment.

Csaba Toth
  • 10,021
  • 5
  • 75
  • 121
  • What is a "genesis repo"? – matt Aug 28 '23 at 00:19
  • @matt not sure what's the proper name but that's the hosted repo, the "O.G." repo. That's what the git server governs and everyone clones. That's where everyone commits. I know got is all decentralized, but git server is a central point in practice. – Csaba Toth Aug 29 '23 at 04:59
  • So you didn't "treat the genesis repo" (meaningless phrase anyway). You operated on a fresh clone, which is what the docs advise: https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#FRESHCLONE – matt Aug 31 '23 at 00:17
  • @matt I didn't operate on a clone. I operated on the hosted original repository. You cannot do that with GitHub or GitLab or BitBucket, but since I'm self hosting I could. How can I describe this better? That's the repo which first existed, originally created. All other clients cloned that. But this is not a clone of anything, it is itself. – Csaba Toth Sep 01 '23 at 06:55
  • @matt So imagine you create an empty repo via GitHub UI. Then you can clone that, but that repository at GitHub was the first one, it is not a clone of anything. In my case since I self host, I have access to that and I went ahead and operated on it directly. – Csaba Toth Sep 01 '23 at 06:58
  • 1
    @CsabaToth There is no "genesis repo" in git. The repo on the server is usually a bare repo that doesn't have a working tree. https://stackoverflow.com/questions/5540883/whats-the-practical-difference-between-a-bare-and-non-bare-repository – mx0 Sep 01 '23 at 07:16
  • @mx0 correct, I modified both my question and answer with that, thanks! – Csaba Toth Sep 02 '23 at 14:49