Turn two commits with removing and adding files into one commit with file move

Question

I imported SVN repository with Subgit and have problems with git blame. The oldest revision for every source file is from 2014 year, even though the project was started in 2008.

This is caused by switching from Ant to Maven (it's Java project) which changed source directory structure from /src/package/ to /src/main/java/package. With svn log i can see that there are two revosions for such purpose:

in first one, the all source code was removed
in the second one, the source files were added to follow Maven directory structure

That's why git blame can only show earliest revision from the day of Ant-> Maven migration.

Can i somehow rewrite git history to make git understand that all files were actually moved, but not removed and re-added?

Tracking a file in Git after it has moved is problematical, but in any case I don't see rebasing helping at all here, unless you rewrite history so that the file were never moved (probably not what you want to do). — Tim Biegeleisen, Nov 27 '16 at 14:33

j6t · Answer 1 · 2016-11-27T21:16:57.843

3

Use git filter-branch to remove the commit that removes all source files:

git filter-branch --commit-filter '
    if [ "$GIT_COMMIT" = insert_SHA1_to_remove_here ];
    then
            skip_commit "$@";
    else
            git commit-tree "$@";
    fi' --tag-name-filter cat --all

See manual page for more details (search for "Darl").

edited Nov 27 '16 at 21:16

answered Nov 27 '16 at 15:20

j6t

9,150
1
15
35

Usually filter-branch should be done only as a last resort. But this method will work, and works even if your Git is so old it lacks `git replace`. – torek Nov 27 '16 at 15:52
I think this could leave any branches created after that merge commit still attached to the old commit tree. They would need to be rebased onto the new commit as well. I'm not sure what happens to tags on these branches. – Mort Nov 27 '16 at 15:57
1

@Mort: the `--tag-name-filter` (which I just noticed is misspelled above) and `--all` take care of that. – torek Nov 27 '16 at 16:01
@torek I've fixed `--tag-name-filter`. Thanks for pointing out the glitch. – j6t Nov 27 '16 at 21:17
Thank you. Looks like i will use this approach. I also has some git notes i want to preserve after history rewrite. How can i get old -> new hash mapping report after `git filter-branch` done it's job? – Kirill Nov 28 '16 at 15:40
There is no direct way. But there is a backup of the old refs in `refs/original`. – j6t Nov 28 '16 at 20:34
@Derp: unfortunately the backup doesn't have a complete mapping. If you know the skipped commits, or if you use a `--commit-filter` like this, you can compute or save the mapping. Filter-branch really should copy or move the notes for you (there's `git notes copy` to do the former, nothing to do the latter), but it doesn't. That's another reason not to run the filter-branch step, but rather just leave a `git replace` in place. – torek Nov 30 '16 at 01:41

torek · Answer 2 · 2016-11-27T16:04:56.157

You have three options

As far as Git is concerned, there is no such thing as a commit with a file move. A commit is just a snapshot: "This is what's in." That's it: no more, no less. In other VCSes, a new commit B that follows an old commit A is not just a snapshot of "what's in", it's also "what changed", possibly including things like "renamed path/to/file to different/path/to/newname". Git, however, chooses instead to (attempt to) reconstruct what changed, by—later, at the time you are looking at it—comparing the new contents of commit B to the old contents of commit A.

In general, Git steps back one commit at a time: compare Y-and-Z, then compare X-and-Y, then compare W-and-X, and so on. That's what git log and git blame do, for instance. Note that I've given the commits single letter names here, and assumed a linear sequence: A--B--C--...--Z. In practice we need longer IDs, and not all sequences are linear (but with any luck the sequences right near this problem are linear).

What this means for you is that you must convince Git not to compare commit H ("commit that, vs G, has files under new name") to commit G ("commit that when compared to F, deletes files under old name") but rather to compare commit H to commit F, skipping over G. In fact, perhaps we want to skip commit H as well, by comparing commit I (the one after H) to commit F (the one before G). That's less critical than skipping over the commit that has the files deleted.

For all our options we need to know (or find) several of Git's commit IDs. The four "particularly interesting" commits are:

The commit where "all files are added again": it's H above, but let's call it addaddaddaddaddaddaddaddaddaddaddaddadda (which is actually a potentially-valid Git hash ID). You will need to find the real ID.
The commit where "all files are deleted". This is the parent of the above, so we can name it using the funny suffix-hat (^) syntax that Git provides, by writing addaddaddaddaddaddaddaddaddaddaddaddadda^. But let's just say we have the raw number as de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e.¹
We may also need to know the commit that comes after addaddaddaddaddaddaddaddaddaddaddaddadda. This is the one we called "I" above: as Git is traversing history in reverse, commit goodgoodgoodgoodgoodgoodgoodgoodgoodgood² leads Git to reach commit addaddaddaddaddaddaddaddaddaddaddaddadda, which leads Git to reach de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e, which of course leads Git to reach ...
The commit before all the deletes. Again, we can use the hat syntax for this—in fact, knowing the "good" commit ID, we can just use goodgoodgoodgoodgoodgoodgoodgoodgoodgood^, then goodgoodgoodgoodgoodgoodgoodgoodgoodgood^^, then goodgoodgoodgoodgoodgoodgoodgoodgoodgood^^^, and so on. But I'll just use de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e^ for this one.

Option 1: just tell `git blame` to skip the commit

You have several ways to do this, but for git blame in particular, you have one option that is not directly available in other Git commands:

-S <revs-file>
Use revisions from revs-file instead of calling git-rev-list(1).

The documentation for this option is poor (in my opinion): the -S file argument is not a revision list, but rather a graft list.

What this means is that instead of git blame <path>, you can run:

echo addaddaddaddaddaddaddaddaddaddaddaddadda \
  $(git rev-parse de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e^) > \
  /tmp/graft
git blame -S /tmp/graft file-you-are-concerned-with

(or similar, depending on your OS). See below for additional tricks, since you might want to skip the "add" commit too. Of course the two raw commit IDs here need to be the right ones.

(If you have the raw ID of the commit before the "delete" commit, you can use that instead of invoking git rev-parse. The nice thing about invoking rev-parse is that you can use abbreviated commits and thus get the full ones, plus of course all the usual gitrevisions syntax. The "echo" is to make sure both IDs are on the same line, as the -S file is handled the same way as the old Git grafts hack.)

Option 2: hide the commit more generally

If you want to hide the commit from most Git commands, you can do that more permanently in one repository (in a way that does not propagate elsewhere) using git replace:

git replace --graft \
    addaddaddaddaddaddaddaddaddaddaddaddadda \
    de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e^

What we we are doing here is telling Git that whenever it's about to look at commit addaddaddaddaddaddaddaddaddaddaddaddadda it should turn its eyes³ instead over to a new "replacement" commit. The git replace command makes the new replacement commit by mostly copying addaddaddaddaddaddaddaddaddaddaddaddadda, but changing its parent from de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e to de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e^, i.e., the commit that came just before the "delete things" commit.

Option 3: really delete the commit(s)

It is possible to discard one or even both intermediate commits. Let's say, for instance, we've decided to discard both addaddaddaddaddaddaddaddaddaddaddaddadda and its previous de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e. The drawback is that this effectively "re-numbers" every commit after that point: every commit starting from goodgoodgoodgoodgoodgoodgoodgoodgoodgood forward. The new, rewritten repository is no longer compatible with the old repository (and if you did your SVN-to-Git conversion with "notes" attached to each commit to remember the corresponding SVN revision, this process wrecks the notes).

To discard the two commits, start with the same the git replace thing as before. This time, however, we want to replace goodgoodgoodgoodgoodgoodgoodgoodgoodgood itself, with a copy that is just like goodgoodgoodgoodgoodgoodgoodgoodgoodgood, except that its parent is the parent of de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e. Hence:

git replace --graft goodgoodgoodgoodgoodgoodgoodgoodgoodgood \
    de1e7ede1e7ede1e7ede1e7ede1e7ede1e7ede1e^

Using our simple single-letter drawing again, what we've done is this:

             -------I'  <-- replacement for I
            /
A--...--E--F--G--H--I--J--...--Z   <-- HEAD

The graft makes Git jump from I to I' by "moving its eyes", so that it never sees H (the re-adds) nor G (the deletes) and jumps directly back to F.

Now that we have the graft in place, we can run git filter-branch --tag-name-filter cat --all. This obeys the graft while copying every commit it sees to new commits.⁴ The copies "before" the replacement I' are bit-for-bit identical to their originals, so they retain their original hash IDs. The copy of I' remains I', but the copies after I' are different, so they get new IDs.

Once the filtering is done, the filter-branch command replaces the old branch and tag names with new branch and tag names pointing to the new copies. (The new tag names are the same as the old tag names, because our tag name filter was cat.)

¹It's the Cyberman commit! You will be upgraded, or deleted!

²This is not a valid commit ID but there is a limit to what we can spell with [0-9a-f]. :-)

³Does Git even have eyes, or am I anthropomorphizing computers again?⁵

⁴While the identifying of commits is always done "backwards", from newest commits back to oldest, the copying that git filter-branch is (necessarily) done "forwards". The way filter-branch works is to copy every commit, with the new copy made after applying any filters. This is why it is so slow. In our case we're doing the copy simply for its side effect of making replacements become permanent.

⁵"Don't anthropomorphize computers, they hate that." —author unknown

score 1 · Answer 3 · answered Nov 27 '16 at 15:56

1

Do you have lots of branches/tags created after the pair of commits in question? If you do, @j6t's filter-branch solution is probably the way to go.

Otherwise you could just git reset --hard to the second commit that added all of the files back. At that point, squash the two previous commits using git rebase -i, or git reset HEAD~ followed by a git commit. At this point, you've squashed your two commits and you can use a git rebase to rebase all of the subsequent commits on the branch back onto the new squashed commit.

answered Nov 27 '16 at 15:56

Mort

3,379
1
25
40

Yes, i have branches. And in this case rebase is not an option at all? I am ok with rewriting history in child branches – Kirill Nov 27 '16 at 16:38
You could probably do that too. If you have tags, also see [here](http://stackoverflow.com/questions/3150685/can-tags-be-automatically-moved-after-a-git-filter-branch-and-rebase) – Mort Nov 27 '16 at 23:55

Turn two commits with removing and adding files into one commit with file move

3 Answers3

You have three options

Option 1: just tell git blame to skip the commit

Option 2: hide the commit more generally

Option 3: really delete the commit(s)

Option 1: just tell `git blame` to skip the commit