2

As git is increasingly advertised (and enhanced) to better support very large repositories (so-called "monorepos"), with major recent enhancements to the sparse-checkout workflow (git-sparse-checkout command and partial clone / promisors / --filter), I'm surprised that I can't find a way to leverage the sparse-checkout configuration/specification when dealing with commit history.

I see that the topic has been partially brought up in previous questions:

The only answers propose per-command path filters, but converting the .git/info/sparse-checkout specification to path filters will often be non-trivial if not impossible.

The lack of sparse-checkout support seems particularly problematic with git diff, where on a large monorepo the differences between two reasonably-distant versions of the repo might be substantially obscured, or effectively unreachable, due to all other teams'/areas' updates, when a simple path filter on the command-line is not viable. This is primarily a readability/reachability/usability concern, but presumably also has a performance component when you are interested in a selection of the tree and all its rename-sources.

Does anyone know whether using sparse-checkout configuration to limit/scope results in git diff (between commits) and other tools like git log is possible, and/or whether such a possibility is in the works?

Tao
  • 13,457
  • 7
  • 65
  • 76
  • A diff filtered in the same way as a sparse checkout would be very useful. I don't know if anyone is working on that right now. A set of simple rules for partial clones, along these same lines, would also be very useful. The place to find out about this is, I think, not here, but rather on the Git mailing list: see https://github.com/git/git/blob/master/.github/CONTRIBUTING.md – torek Apr 20 '20 at 19:24
  • Only recently has a large amount of attention been paid to sparse checkout. It has sat in relative obscurity with little attention for a long time. I agree with torek that questions are best addressed on the list, because I don't know that anyone who knows about it frequents Stack Overflow. Of course, I personally recommend against monorepos altogether, because while they can be made to work, they are generally more painful to use. – bk2204 Apr 21 '20 at 00:00

1 Answers1

1

From the comments:

Can I run "git log" (or git diff, between arbitrary commit ranges) in such a way that only commits affecting my sparse checkout cone are returned?

You can show logs for a folder, and from there diff between only commits listed from that log.

A native way to limit git diff to the commits from the cone folder is not, to my knowledge, in the work.


Some commands (status/diff) will run even better with Git 2.34+ (Q4 2021) and the introduction of sparse index in Git 2.34: they will deal only with the files and index of your sparse cone.

It does not change the initial answer (in a sparse cone, you need to do a git log -- aConeFolder in order to get the relevant commits)
But it will improve considerably the execution speed of worktree- and index-related commands (diff, status, ...)

As you can imagine, even if you are working in a small corner of a large repository, the index still has to keep track of the repository’s entire contents, not just the parts that you are working in.

Unfortunately, that overhead adds up: every time Git needs work with the index, it needs to parse and write out a lot of data that doesn’t affect the parts of your repository outside of your sparse checkout.

That’s changing in this release with the addition of a sparse-enabled index.

Unlike the index of previous versions, this release enables the index to only track the parts of your repository that you care about.
Specifically, it only contains entries for parts of your repository that are either in your sparse checkout, or at the boundary between your sparse checkout and the rest of the repository.

Collapsing to a sparse index -- https://github.blog/wp-content/uploads/2021/11/Fig-9-collapsing-to-sparse-index.png

Triangles represent trees and boxes represent blobs.

  • Left: a representation of a non-sparse index’s contents.
  • Right: a sparse-ified index.

The high-level details here are that the index format now understands that specially marked directories indicate the boundary between the contents of your sparse checkout and the parts of your repository that you don’t have checked out.

So git diff/status/... should now only operate on the sparse data you require (checkout and index within your sparse cone) instead of dealing with a full index.

More details with "Make your monorepo feel small with Git’s sparse index" from Derrick Stolee.


And with Git 2.37 (Q3 2022), "sparse-checkout" learns to work well with the sparse-index feature: meaning it is much faster, and can handle large repositories.

See commit 598b1e7, commit b0b40c0, commit ac8acb4, commit 0243930, commit 2d44338, commit 080ab56, commit 9fadb37, commit dce241b, commit 8846847, commit baa73e2 (23 May 2022) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit c276c21, 03 Jun 2022)

sparse-checkout: integrate with sparse index

Signed-off-by: Derrick Stolee

When modifying the sparse-checkout definition, the sparse-checkout builtin calls update_sparsity() to modify the SKIP_WORKTREE bits of all cache entries in the index.

Before, we needed the index to be fully expanded in order to ensure we had the full list of files necessary that match the new patterns.

Insert a call to reset_sparse_directories() that expands sparse directories that are within the new pattern list, but only far enough that every necessary file path now exists as a cache entry.
The remaining logic within update_sparsity() will modify the SKIP_WORKTREE bits appropriately.

This allows us to disable command_requires_full_index within the sparse-checkout builtin.

We can see the improved performance in the p2000 test script:

Test                           HEAD~1            HEAD ------------------------------------------------------------------------
2000.24: git ... (sparse-v3)   2.14(1.55+0.58)   1.57(1.03+0.53) -26.6% 
2000.25: git ... (sparse-v4)   2.20(1.62+0.57)   1.58(0.98+0.59) -28.2%

These reductions of 26-28% are small compared to most examples, but the time is dominated by writing a new copy of the base repository to the worktree and then deleting it again.

The fact that the previous index expansion was such a large portion of the time is telling how important it is to complete this sparse index integration.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Hi @VonC, I may be missing something, but I don't think the Sparse Index work tackles this at all :( It is a largely transparent optimization, and its implementation does affect the functioning of a diff of the working tree to the index or the index to commits, but it does not change the expected output when comparing two commits, or the set of commits returned in a git log. – Tao Nov 17 '21 at 13:36
  • @Tao it does not change the output, but greatly improve its execution. – VonC Nov 17 '21 at 13:43
  • OK, thank you, but the I posed question is a functional one: Can I run "git log" (or git diff, between arbitrary commit ranges) in such a way that only commits affecting my sparse checkout cone are returned? – Tao Nov 17 '21 at 14:01
  • @Tao OK, I suspect not: the history still need to be complete, in order to get from one commit (affecting your cone) to another commit (affecting your cone): the parent commit chain needs to be uninterrupted. As such, for now, a git log would display those intermediate commits, even though they might not involve any file from your cone. – VonC Nov 17 '21 at 14:14
  • @Tao I have edited the answer with a possible workaround. – VonC Nov 17 '21 at 14:24
  • My question is insufficiently precisely phrased. I feel bad about changing it a year later to explicitly disqualify your answer, but the (bulk of the) answer really doesn't apply to the question as intended. – Tao Nov 17 '21 at 16:11
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239324/discussion-between-vonc-and-tao). – VonC Nov 17 '21 at 16:13