Git cloning large repositories hogs memory

Question

Serving a fairly large repository (9.5Gb for the bare .git) everytime someones needs to clone it it will spend a large amount of memory (and time) during the "Compressing objects" phase.

Is there a way to optimize repositories files so it will store the data as ready as possible for transmission? I wanted to avoid redoing the work everytime a clone is requested.

I'm using gitlab as a central server, but this also occurs cloning directly over ssh from another machine.

Maybe you can compress all the files in the repo before uploading so next time you clone you just need to decompress it. It's trading computational power over internet speed. — Algo7, Sep 27 '20 at 01:21

score 2 · Answer 1 · answered Sep 27 '20 at 15:36

Think about how you want to manage a repository this size. Are you required to keep all this history? Do you have a plan for large files? Does it make sense to split into multiple repos?

Run gc via housekeeping in case there are unreachable objects.

Packing large objects could use a significant amount of memory, multiplied by the number of threads.

Gitaly doesn't set core.bigFilesThreshold, so that might be difficult to tune on GitLab. Try setting that much lower, perhaps 1M, on some other copy of repo. Not having deltas on large files will increase the space on disk, but reduce memory use.

GitLab supports the LFS extension. A project to implement this would be a sizable undertaking: needs some object storage, getting users configured to use it, and rewriting history to remove the large files now somewhere else.

Or don't change much: configure a generous amount of memory, and expect clones to take a couple minutes.

I can't rewrite history bc that would change all hashes and we use those. Looks like more memory will be the only solution. I think there should be a way to store the repo as it will be transmitted (maybe deflated) and clones would be simply reading from disk and sending it. — Vargas, Sep 28 '20 at 12:55
I don't think you can skip pack in the network protocols, such as over ssh. Maybe `git bundle` and transfer those archives, but that's a process without tools in GitLab. Dedupe is memory intensive, and if you have large files that don't dedupe, this is going to remain a problem. Try to tune `core.bigFilesThreshold` — John Mahowald, Sep 28 '20 at 17:20

Git cloning large repositories hogs memory

1 Answers1