4

I have a massive repository - it's more than a gigabyte. Cloning the repository takes hours. However, most of that size is because of a data directory that isn't needed to work on the project locally. However, I certainly don't have the authority to simply remove the directory from the repository.

Is there any way to apply a filter to the repository before it's cloned, so that I only download the files I actually need to work on?

Robin Winslow
  • 10,908
  • 8
  • 62
  • 91
  • 1
    Is this an option? [git clone --depth](http://stackoverflow.com/questions/6941889/is-git-clone-depth-1-shallow-clone-more-useful-than-it-makes-out) – mheinzerling Feb 20 '13 at 12:18
  • @mnhg unfortunately not. The directory I'd like to exclude is at the top level, and the depth of the useful code is much much deeper than that. – Robin Winslow Feb 20 '13 at 14:25
  • depth is a number of revisions not a directory level (see http://git-scm.com/docs/git-clone) – mheinzerling Feb 20 '13 at 14:29
  • @mnhg oh okay fair enough. I don't really get how that works then, because each commit depends on the last... – Robin Winslow Feb 20 '13 at 14:36
  • Sorry, cant help you here. Never used it. – mheinzerling Feb 20 '13 at 14:38
  • @mnhg From [git-clone](http://www.kernel.org/pub/software/scm/git/docs/git-clone.html#_options): "A shallow repository has a number of limitations (you cannot clone or fetch from it, nor push from nor into it)" - this probably makes it not very useful in my situation. I need to actually make changes to the central repository. – Robin Winslow Feb 20 '13 at 15:04
  • @RobinWinslow That shallow clone limitations were lifted in Git 1.9 – Peter Knego Jan 08 '16 at 08:55

2 Answers2

1

It is possible to apply a filter when cloning the repository, not before.
It would use git clone --filter=..., as detailed in "What is the git clone --filter option's syntax?"

For example, a minimal amount of data to clone would be:

#fastest clone possible:
git clone --filter=blob:none --no-checkout https://github.com/git/git
cd git
git sparse-checkout init --cone
git read-tree -mu HEAD

And since Git 2.37 (Q3 2022), "git remote -v"(man) now shows the list-objects-filter used during fetching from the remote, if available.

git clone --filter=blob:none "file://$(pwd)/srv.bare"
git remote -v
    srv.bare (fetch) [blob:none]

That is also why Git 2.38 (Q3 2022), will enable "git fetch"(man) client to log the partial clone filter used in the trace2 output.

See commit 1007557 (26 Jul 2022) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 3a4d71f, 05 Aug 2022)

fetch-pack: write effective filter to trace2

Signed-off-by: Jonathan Tan

Administrators of a managed Git environment (like the one at $DAYJOB) might want to quantify the performance change of fetches with and without filters from the client's point of view, and also detect if a server does not support it.

Therefore, log the filter information being sent to the server whenever a fetch (or clone) occurs.
Note that this is not necessarily the same as what's specified on the CLI, because during a fetch, the configured filter is used whenever a filter is not specified on the CLI.

GIT_TRACE2=1 git fetch

Note that before Git 2.40 (Q1 2023), "git http-fetch"(man) (which is rarely used) forgot to identify itself in the trace2 output.

See commit 7abb43c (12 Dec 2022) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit c099531, 26 Dec 2022)

http-fetch: invoke trace2_cmd_name()

Signed-off-by: Jonathan Tan

ee4512e (trace2: create new combined trace facility, 2019-02-22, Git v2.22.0-rc0 -- merge listed in batch #2) ("trace2: create new combined trace facility", 2019-02- 22) introduced trace2_cmd_name() and taught both the Git built-ins and some non-built-ins to use it.
However, http-fetch was not one of them (perhaps due to its low usage at the time).

Teach http-fetch to invoke this function.
After this patch, this function will be invoked right after argument parsing, just like in remote-curl.c.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Fastest-clone filter (not by a very big margin, I'm picking lint here) is `--filter=tree:0` – jthill Sep 06 '22 at 02:29
0

No, by design of git that is absolutely not possible. You will have to change the central repository.

As an interim solution you could create a new branch, filter only this branch and do a empty merge from master to your branch. Now people can clone just your branch and work on it. You will then have to merge to master somewhere. But since you added that empty merge, you can now merge between those two branches whenever you want – as long as you don’t change the data directory on master.

edit: Sry, the empty merge would defeat the whole purpose, as clients would then again pull down all the data.

Chronial
  • 66,706
  • 14
  • 93
  • 99
  • Okay, what if I checkout out `master`, created a new `master-skinny` branch and ran `git filter-branch --index-filter ...` and committed and pushed the filter to my new branch. Now would it be possible for people to checkout `master-skinny` without pulling down all the extra version information for `master`? And is it then at all possible to merge changes back into `master` without pulling down the whole of `master`? Or maybe this could only be managed using forks and pull requests... – Robin Winslow Feb 20 '13 at 14:23
  • I think I more-or-less answered my own question. So presumably this would be possible - fork the repository, pull down my fork, `git filter-branch --index-filter ...`, commit back. Now people can clone my skinnier repository. Is it possible for people to submit pull-requests that don't include the `filter` changes, but include all other changes? – Robin Winslow Feb 20 '13 at 14:27
  • They can’t really submit pull requests in that situation, and you also can’t merge without having the full repo. But you can merge their changes in the repo on one Machine that has the full repo, and everybody else will be fine with the skinny version. To do this you should use a graft to pretend that the full repo had previously been merged into the skinny repo. – Chronial Feb 20 '13 at 18:56
  • okay so the answer really is just a flat-out no. I would have to get all collaborators to stop using the remote, filter the repository, push my changes back to the remote, and get all collaborators to pull again. I'd appreciate it if you could summarise the points discussed in the comments here in the answer? – Robin Winslow Feb 21 '13 at 14:34