How can I track a subset of files from a remote repository?

Question

I'm trying to solve the following situation: I'd like to include a (not owned, public) project into mine, resizing a little bit the original file tree by removing redundant and/or not-needed files, and only leaving the bare minimum, BUT also retaining the possibility of tracking modifications to the original files.

I've tried making my own copy of said repository, adding the original as remote, but that only works up until I start deleting files from my own copy, at which point trying to fetch the remote changes fails as I'm missing files.

Is that normal? Did I mess something up in the process, and is there a more elegant way to accomplish this?

Have you considered using sparse-checkout so that in your working tree you only get to see the subset of files you care about while not deleting the files you do not care about it that much? — eftshift0, Oct 02 '22 at 17:59
To expand a little bit: You can't just tell git to not care about some files from a branch anymore. If you delete the files from your branch and then you want to merge/cherry-pick something that involves changes to _those_ files, you will get conflicts.... _tree_ conflicts, actually. — eftshift0, Oct 02 '22 at 18:02
@eftshift0 wouldn't sparse-checkout only affect my working directory? If so, the issue at hand is that I don't care so much about _seeing_ those files, but rather that of those files, I only need maybe ~5% of the original repo in _size_. The original repo weights >600MB and most of is composed by vendor examples and documentation, and I'd prefer whoever needs to clone my repo not to also have to deal with 600MB every time. — Lachmann, Oct 04 '22 at 11:41
Well.... that's the price you pay for it being distributed..... I think you are overthinking it. There are _shallow clones_, _sparce checkouts_.... and if you **reeeeeally** feel like it, you can start an orphan branch that has no (previous) history. — eftshift0, Oct 04 '22 at 11:46
@eftshift0 >>you can start an orphan branch that has no (previous) history Been there, done that before, and it turned out to be a huge mess as I tried to reintegrate new changes from the original repo into mine. If those are the only viable options, I'll reconsider going down this route. Thanks for the info btw — Lachmann, Oct 04 '22 at 11:50
Anytime. @torek will provide great feedback in his answer/comments, I am sure. — eftshift0, Oct 04 '22 at 12:02

score 1 · Answer 1 · answered Oct 02 '22 at 18:06

The short answer is that you can't do it this way: Git is based on commits, not files, and every commit holds a full snapshot of every file. What this implies is that if you make a new commit in which some file does not exist, the difference between the old commit and the new commit is that the file is deleted. Any attempt to use a later commit from the other repository—which requires some kind of merge work, regardless of whether that's a cherry-pick from a rebase, a manual cherry-pick, or a git merge operation: all of these perform the merge-as-a-verb action—will consider your deletion of the file as just that: deletion of the file.

That's not ultimately fatal (because you can resolve the modify/delete conflict whichever way you need to), but it's a bad plan in general.

In any case, a repository is not allowed to contain another repository, so if you have your own repository and you'd like to clone and make use of some other repository as a subset, you're either faced with:

incorporating all of their files directly into your own repository, after which your commits and their commits are unrelated and hence Git can't help much; or
incorporating all your files into their repository, which is likely to be "upside down" from the way you want things to be; or
using submodules, which have their own issues.

In general, submodules—while painful (people call them sob-modules for a reason)—tend to be the favored approach here. A lot of Google software, for instance, uses submodules this way.

I guess that among the choices you suggest, using submodules is the best way to go. How would I go on including only the files I'm interested in through submodules? I already incorporate other, _whole_ repos in my projects as submodules, but I've never only included a _part_ of them. — Lachmann, Oct 04 '22 at 11:44
The short version is "you don't". You *can* use sparse-checkout to *check out* only specific files, but it now defaults to "cone mode" which probably isn't what you want, and the old sparse-checkout style is being slowly deprecated in favor of cone mode. — torek, Oct 04 '22 at 15:40

Chris Burghart · Accepted Answer · 2022-10-06T15:20:52.147

0

WARNING: the answer below only works in very limited cases where the removed files are never modified in the upstream repository

Because of the files missing from your copy of the remote repository, a git pull will fail with a "divergent branches" error as soon as any later commits exist on the remote. However, in your case, a git rebase should do exactly what you want.

In simple terms, a rebase will just reapply your commits onto a selected commit of the original repository (typically origin/main). You will end up with a copy of the current origin/main minus the files you chose to remove. Check the git-rebase documentation for details.

Here's an example:

# Clone a repository and remove some files from my local copy
git clone https://github.com/some_repo
cd some_repo
git rm file_a file_b
git commit -m "remove unneeded files"
git rm file_c
git commit -m "remove file_c"

# At a later time, bring in new commits from the
# remote repository and rebase my commits (removals)
# atop the updated content
git fetch
git rebase origin/main

edited Oct 06 '22 at 15:20

answered Oct 04 '22 at 19:48

Chris Burghart

66
4

I'll try doing this way then, should be relatively straightforward. Thanks! Just one more thing, would this process also scale to a situation where, say, the owner of the original repo adds some subfolders and files within a folder I've decided to delete? Or would that entail some more work on my side? Say that original repo has dir_a, dir_b, dir_c, file_d, and I leave in my repo only dir_b; if the owner adds new_subdir_a1 within dir_a, does rebase still work as it normally would? – Lachmann Oct 04 '22 at 20:09
Git doesn't store folders as separate entities (i.e., you can't commit an empty folder into a Git repository). A `git rm -r ` individually removes each file contained in that folder from the repository. More importantly, I realized that my answer is very misleading, and only works correctly in limited cases. For example, the rebase will fail with a conflict if a file you have deleted gets changed upstream. My apologies! – Chris Burghart Oct 04 '22 at 21:18
I guess I didn't finish my initial response above. Any completely new files, even under previously 'removed' folders, _will_ show up in your rebased copy. – Chris Burghart Oct 04 '22 at 22:35
Given the significant shortcomings I've found in my answer, I'm inclined to delete it. I plan to leave it up a day or two so you can see these comments @Lachmann. – Chris Burghart Oct 04 '22 at 22:37
I cannot delete this answer since it has been accepted, so I've just updated it with a warning about its limited application. – Chris Burghart Oct 06 '22 at 14:54

How can I track a subset of files from a remote repository?

2 Answers2