2

I'm writing a git filter-branch --tree-filter command that uses git log --follow to check if certain files should be kept or deleted during the filtering.

Basically, I want to keep commits that contain a filename, even if this file was renamed and/or moved.

This is the filter I'm running:

git filter-branch --prune-empty --tree-filter '~/preserve.sh' -- --all

This is the command I'm using inside preserve.sh:

git log --pretty=format:'%H' --name-only --follow --all -- "$f"

The result is that a commit that creates a file that is later moved to another path is stripped out of history when I'm searching for the file in the new path, which shouldn't happen. For example:

commit 1: creates foo/hello.txt;

commit 2: moves foo/hello.txt to bar/hello.txt;

using git filter-branch passing bar/hello.txt yields a history with only commit 2.

At first, I thought the problem was happening because I wasn't using --all in git log, that is, when analyzing commit 1 it wouldn't find foo/hello.txt because it was only looking in past history where bar/hello.txt isn't mentioned anywhere. But then I added --all, which looks to all commits (including the "future" ones), however, nothing changed.

I checked out to the commit where the file is being created, ran that log command and it worked (listed both foo/hello.txt and bar/hello.txt), so there's nothing wrong with it. I also logged the results of the log command when it's run by filter-branch and in this case I can see that in commit 1 the file is not found (only bar/hello.txt is listed).

I think this problem happens because internally git is copying each commit to a "new repo" structure so by the time it's analyzing commit 1 the newer commits don't exist yet.

Is there a way to fix this, or another way to approach the problem of re-writing history while preserving renames/moves?

I'm running a modified version of the script found in this answer.

Community
  • 1
  • 1
Roberto
  • 11,557
  • 16
  • 54
  • 68

2 Answers2

1

or another way to approach the problem of re-writing history while preserving renames/moves?

Consider using, since git filter-branch is soon deprecated, the new newren/git-filter-repo.

But even that new tool (based on git fast-export/git fast-import) would not follow renamed files.

See newren/git-filter-repo issue 25 which indirectly illustrates the challenges of filtering a repository (with the old git filter-branch or the new filter-repo command) while taking renamed files into account.

[...] This is consistent with how the rev-list, log, and fast-export git subcommands work. E.g. git log -- src/ledger/bin/app/app.cc won't show any history for other paths that this file was renamed or copied from (or for which parts of it came from).
You used the --follow flag specifically, which is a big hack as even noted in the git log documentation (it mentions that it only works when a single file is specified).
If rev-list/log/fast-export, etc. had a --follow option that followed renames, I could simply expose it from filter-repo, but despite the desire for such an option no one has implemented it in many years.
There's some good challenges there too, e.g. we'd probably want to traverse in topological order and we may need two passes -- one to create the topological ordering, and the second to build up additional paths from renames. (A case where this might be necessary: some branch builds on top of 'master' and has some paths within the specified pathspec that came from a rename of something outside the pathspec at the time 'master' existed. If 'master' was traversed before the other branch, then we'd have already picked the more limited pathspec and miss the extra needed paths.)

But even if --follow implemented following of renames for multiple files or a directory or more, that still wouldn't necessarily be sufficient because perhaps the user needs copy detection (i.e. it wasn't a file renamed from somewhere else, rather it was copied).
But with copy detection it's not as clear if you want the full history of the original; I can imagine that in some cases you would but not others.

And if we start doing either rename or copy detection, then we're moving from well-defined correct behavior to heuristics.
For diffs or logs or even merges that's fine, because the results will be interpreted by a user (even in merge, if the detection is wrong, the user can fixup conflicts and make other edits).
Here, we'd record the results of the heuristics in stone. That's a bit worrying to me...and it also means we'd have to open up a pile of knobs (at the very least a similarity percentage, and whether copies are wanted in addition to renames) for configuration.

All that said, I wanted something like that when I was using it too.
The best compromise I came up with was to have people run 'git filter-repo --analyze' beforehand, look at the renames sub-report, and pick out additional paths by hand based on that to feed to their filter-repo run.
The --analyze option still had a few caveats with the rename detection, but that was mostly fundamental to the problem. Providing it and letting the user decide what to include (though I didn't even bother with copy detection), seemed like the best option I had available.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • I'm fine with moving to heuristics. The repo has a huge history, analyzing commits will be too much work and it's okay if the history is not entirely correct since I'm just copying subfolders to a new repo (original is kept untouched). Does `filter-repo` solve my problem automatically? – Roberto Dec 18 '19 at 05:18
  • @Roberto Maybe not automatically, but `git filter-repo --analyze` can provide more information. – VonC Dec 18 '19 at 05:23
1

Essentially what you want to do here is:

  1. Build a map of all commits in the repository, indexed by hash ID.
  2. For each commit, determine the path names you wish to keep / use when running your filter.
  3. Run git filter-branch—or, at this point, just run your own code, since the map you built in step 1, and the stuff you computed in step 2, are a significant part of what filter-branch does—to copy old commits to new commits.
  4. If you are using your own code, create or update branch names for the last copied commits.

You can git read-tree to copy each commit into an index—you can use the main index, or a temporary one—and then use the Git tools to modify the index so as to arrange in it the names and hash IDs that you wish to keep. Then use git write-tree and git commit-tree to build your new commits, just like filter-branch does.

An easier case

You may be able to simplify this somewhat, if you don't have too many alternative names for files. For instance, suppose that the history—the chains of commits—in the repository looks like this, with two great History Bottlenecks B1 and B2:

  _______________________          ________________          _________
 /                       \        /                \        /         \--bra
< large cloud of commits  >--B1--< cloud of commits >--B2--<    ...    >--nch
 \_______________________/        \________________/        \_________/--es

where the file names that you want to keep are all the same within any one of the three big bubbles, but at commit B2 there is a mass renaming so the names are different in the middle bubble, and likewise at B1 there's a mass renaming so the names are different in the first bubble.

In this case, there's a clear historical test you can perform, in any filter—tree filter, index filter, whatever you like (but index filters far faster than tree filters)–to determine which file names to keep. Remember that filter-branch is copying commits, one by one, in topological order so that the newly copied parents are created before any newly copied children must be created. That is, it works on commits from the first group first, then it copies bottleneck commit B1, then it works on commits from the second group, and so on.

The hash ID of the commit being copied is available to your filter (regardless of which filter(s) you use): it's $GIT_COMMIT. So you simply need to test:

  • Is $GIT_COMMIT an ancestor of B1? If so, you're in the first set.
  • Is $GIT_COMMIT an ancestor of B2? If so, you're in the first or second set.

Hence an index filter that consists of "preserve names from set of names" can be written as:

if git merge-base --is-ancestor $GIT_COMMIT <hash of B1>; then
    set_of_names=/tmp/list1
elif git merge-base --is-ancestor $GIT_COMMIT <hash of B2>; then
    set_of_names=/tmp/list2
else
    set_of_names=/tmp/list3
fi
...

where files /tmp/list1, /tmp/list2, and /tmp/list3 contain the names of the files to keep. You now need only write the ... code that implements the "keep fixed set of file names during index filter operation". This is actually already done, mostly anyway, in this answer to extract multiple directories using git-filter-branch (as you found earlier today).

torek
  • 448,244
  • 59
  • 642
  • 775
  • unfortunately, move/rename commits are spread over the history – Roberto Dec 19 '19 at 21:37
  • In that case, do steps 1 and 2: build the table of "files to keep, organized by original commit hash ID". You can then write your own tool, or use `git filter-branch` with an index filter consisting of "keep the files listed in the big table under hash ID `$GIT_COMMIT`". – torek Dec 19 '19 at 21:55