1

Some git repositories come in really huge sizes: DragonFly BSD .git directory is 324MB, and FreeBSD's is above 0.5GB in packed size, and above 2GB in unpacked size.

Does Gitweb, cgit or any other web-tools do any kind of pre-caching for these huge repositories?

How can one estimate the optimal amount of resources (e.g. memory and CPU constraints) for a web-interface to a couple of such huge repositories? What would be the response time for a blame or a log operation of a random file?

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
cnst
  • 25,870
  • 6
  • 90
  • 122

2 Answers2

1

Thanks to git object store model, git repository size is not really an issue for gitweb and similar tools (btw, 500MB repo size is rather small - Linux kernel is close to 1GB now, Android frameworks/base is few gigabytes).

This is because gitweb does not need to pull whole repository to show you tree - it can always look at just few objects: commit objects to show commits, tree objects to display directories, and blob objects to show you files.

The only operation that might slow down gitweb is displaying history for single file, but this does not happen very often, and even then git is pretty good at coping with that without much trouble.

As far as gitweb speed concerned, best optimization you could make is to run gitweb (which is Perl script) under mod_perl, such that Perl interpreter is loaded into memory just once. This alone will make gitweb fly, and git operations will be almost not noticeable.

mvp
  • 111,019
  • 13
  • 122
  • 148
  • Not just `log`, but `blame` will also be slow. I tried doing a blame on my rather unloaded quad-core server, w/ DragonFly's .git, and `time git blame sys/sys/sockbuf.h` took 5 seconds the first time (e.g. disc), and then now it always takes 0.36s, with using a near 100% CPU. That means that there are like at most 10 blame or log requests per second per server (that has enough RAM to hold all the repositories in mem), and that's quite a poor scalability as a web-service, methinks. Is there really nothing much that can be done to pre-cache the stuff (by file-name) for the web? – cnst May 21 '13 at 16:27
  • `git log` is not slow. `git log file` maybe. How often your users actually use blame over gitweb? I'd say not very often. Also, is your repository well packed? Like, did you run `git gc` on it? – mvp May 21 '13 at 16:44
  • isn't there a way to make `git log file` faster, by having an index of which commits are applicable to which filenames? I realise this is probably not needed for local development, where a five-second wait time is good enough, but wouldn't such an optimisation be quite interesting for the web side of git? – cnst May 21 '13 at 17:02
  • I just don't see where did you get 5 second delay from? I took one of the biggest git repos I know of - Linux kernel, and tried to `git log file` on any file in it (even going back 8 years), and I just could not get run time higher than 1 second. File CREDITS was changed 123 times during last 8 years, and yet `time git log CREDITS` takes 0.8 seconds on my machine (not using SSD, mind you). All other files I tried (even dating back to 2005) work faster than that. – mvp May 22 '13 at 16:42
  • I did get 5 seconds; I just tried it again, a couple of days later, and `time git blame sys/sys/sockbuf.h` did again took 5 seconds: `0.400u 0.130s 0:05.50 9.6% 0+0k 3196+55io 7312pf+0w`, but then only 0.36s on the second attempt, `0.320u 0.040s 0:00.36 100.0% 0+0k 0+10io 0pf+0w`, e.g. it's totally reproducible: 5 seconds on first, 0.36s on subsequent attempts. I think in your case, you might have had the whole repo freshly cached in memory, and there was no HDD access required. `git version 1.7.6`, `OpenBSD 5.2 amd64`. Else, maybe you have a new version of git that's faster? – cnst May 22 '13 at 22:52
1

(Update June 2017, 4 years later)
Actually, those repos are tiny.
Microsoft Windows code base is a huge repo: 3.5 million files and is over 270 GB in size.
And Git manages it well... with the addition of GVFS (Git Virtual File System), announced last February 2017: that solves most of Git scaling issues (too many pushes, branches, history and files).

And the commands remain reasonably fast (source: "The largest Git repo on the planet")

https://msdnshared.blob.core.windows.net/media/2017/05/Performance.png

For context, if we tried this with “vanilla Git”, before we started our work, many of the commands would take 30 minutes up to hours and a few would never complete.

See more with "Beyond GVFS: more details on optimizing Git for large repositories"

This is not yet available in native Git, but the Microsoft team is working to bring patches upstream.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250