39

I am trying to format an entire repo using a code formatter tool. In doing so, I want to keep information about who committed which line, so that commands like git blame still show the correct information. By this, I mean it should show the author that previously edited each line (before it was formatted).

There is the git filter-branch command which allows you to run a command against each revision of the repo starting from the beginning of time.

git filter-branch --tree-filter '\
  npx prettier --write "src/main/web/app/**/**.{js, jsx}" || \
  echo "Error: no JS files found or invalid syntax"' \
  -- --all

It will take forever to run this and really I don't care about the past. I just want to format the master branch going forward without changing ownership of each line. How can I do this? I tried playing with the rev-list at the end and other filter types but it still doesn't work. There must be a way to format the codebase while preserving the author information for each line.

aherriot
  • 4,495
  • 6
  • 24
  • 36
  • So the question is "How to edit git history without modifying git history?", right? – phd Nov 27 '18 at 15:33
  • 2
    No, you misunderstand. The git filter-branch command allows me to edit lines, without changing the author of the revision, so `git blame` still works. I simply want to do this for the HEAD and not past revisions. – aherriot Nov 27 '18 at 15:36
  • Then don't use `git filter-branch` — just run the formatter, add and commit. Or if you want to modify the last commit in the branch — `git add` and `git commit --amend`. – phd Nov 27 '18 at 15:39
  • 2
    The problem is that this will change the author of every line to be me. When I say that I want git blame to still work, I mean it should list the author of the previous revision of the line. – aherriot Nov 27 '18 at 15:40
  • 1
    @aherriot - You're characterizing what `filter-branch` does as "editing the line without changing the author"; that's conceptually incorrect. "Lines" don't have authors. Commits have authors, and `git blame` figures out what commit most recently changed each line and reports information about that commit (including author). – Mark Adelsberger Nov 27 '18 at 15:50
  • Some tools allow you to reformat incrementally. For example [darker](https://github.com/akaihola/darker) reformats Python code using Black, but only applies reformatting to lines modified by the developer. – akaihola Aug 08 '20 at 16:29

6 Answers6

48

You can make git blame ignoring certain commits, which do only mass reformatting etc.:

Create a file .git-blame-ignore-revs like:

 # Format commit 1 SHA:
 1234af5.....
 # Format commit 2 SHA:
 2e4ac56.....

Then do

git config blame.ignoreRevsFile .git-blame-ignore-revs

, so that you don't have to use the --ignore-revs-file option every time with git blame.

Upvote https://github.com/github/feedback/discussions/5033 to get that feature into github's web blame viewer.

kxr
  • 4,841
  • 1
  • 49
  • 32
  • 1
    This landed in git, in public beta now. In 2022, this should be the new accepted answer. – pxwise Apr 08 '22 at 18:35
  • Case study/example: [Django pull request that autoformatted their whole codebase](https://github.com/django/django/pull/15387/) and the associated [`.git-blame-ignore-revs`](https://github.com/django/django/commit/b9fee0f849a88091fb0615dd433b9e54f05b32c5). – ggorlen Jun 01 '23 at 19:40
11

What you are trying to do is impossible. You cannot, at some point in time, change a line of code, and yet have git report that the most recent change to that line of code is something that happened before that point in time.

I suppose a source control tool could support the idea of an "unimportant change", where you mark a commit as cosmetic and then history analysis would skip over that commit. I'm not sure how the tool would verify that the change really was cosmetic, and without some form of tool enforcement the feature would assuredly be misused resulting in bug introductions potentially being hidden in "unimportant" commits. But really the reasons I think it's a bad idea are academic here - the bottom line is, git doesn't have such a feature. (Nor can I think of any source control tool that does.)

You can change the formatting going forward. You can preserve the visibility of past changes. You can avoid editing history. But you cannot do all three at the same time, so you're going to have to decide which one to sacrifice.

There are actually a couple down-sides to the history rewrite, by the way. You mentioned processing time, so let's look at that first:

As you've noted, the straightforward way to do this with filter-branch would be very time consuming. There are things you can do to speed it up (like giving it a ramdisk for its working tree), but it's a tree-filter and it involves processing of each version of each file.

If you did some pre-processing, you could be somewhat more efficient. For example, you might be able to preprocess every BLOB in the database and create a mapping (where a TREE contains BLOB X, replace it with BLOB Y), and then use an index-filter to perform the substitutions. This would avoid all the checkout and add operations, and it would avoid repeatedly re-formatting the same code files. So that saves a lot of I/O. But it's a non-trivial thing to set up, and still might be time consuming.

(It's possible to write a more specialized tool based on this same principle, but AFAIK nobody has written one. There is precedent that more specialized tools can be faster than filter-branch...)

Even if you come to a solution that will run fast enough, bear in mind that the history rewrite will disturb all of your refs. Like any history rewrite, it will be necessary for all users of the repo to update their clones - and for something this sweeping, the way I recommend to do that is, throw the clones out before you start the rewrite and re-clone afterward.

That also means if you have anything that depends on commit ID's, that will also be broken. (That could include build infrastructure, or release documentation, etc.; depending on your project's practices.)

So, a history rewrite is a pretty drastic solution. And on the other hand, it also seems drastic to suppose that formatting the code is impossible simply because it wasn't done from day 1. So my advice:

Do the reformatting in a new commit. If you need to use git blame, and it points you to the commit where reformatting occurred, then follow up by running git blame again on the reformat commit's parent.

Yeah, it sucks. For a while. But a given piece of history tends to become less important as it ages, so from there you just let the problem gradually diminish into the past.

Mark Adelsberger
  • 42,148
  • 4
  • 35
  • 52
  • 2
    "I'm not sure how the tool would verify that the change really was cosmetic" -- For code, maybe if the tool had the ability to compare the abstract syntax tree of the revisions rather than the text, it could tell that they are logically equivalent and therefore cosmetic. – Eric Smith Mar 21 '20 at 18:24
  • 1
    @EricSmith There have been tools that attempted to provide so-called "structural diffs". There are a lot of costs and limitations to that approach. If you have language-specific diff tools you want to use, and they aren't too resource-consuming to be used when diffing an entire tree's worth of files, you can use .gitattributes to configure them. – Mark Adelsberger Mar 21 '20 at 19:28
10

git blame -w -M is supposed to ignore whitespace and moved code changes, so you just need to reformat your code and remember to use those options when looking for who to blame!

https://coderwall.com/p/x8xbnq/git-don-t-blame-people-for-changing-whitespaces-or-moving-code

allgood
  • 529
  • 5
  • 11
4

Mercurial has an (experimental) option for this, "--skip":

--skip <REV[+]>
    revision to not display (EXPERIMENTAL)

I think there is no equivalent yet in default git, but there is a hyper-blame command developed externally.

A similar option (--ignore-rev <rev> and --ignore-revs-file <file> is available in git since 2.23: https://git-scm.com/docs/git-blame#Documentation/git-blame.txt---ignore-revltrevgt.

In my experience, both don't deal really well with formatting changes, especially when multiple lines are folded into one.

Marco Castelluccio
  • 10,152
  • 2
  • 33
  • 48
2

git filter-branch --tree-filter "find < dir > -regex '.*.(cpp\|h\|c\|< etc >)' -exec < formatter-command > {} \;" -- --all

< dir > : directory of concerned, since above needs to be run from the root dir, but you may want to format only certain sub-dir under the root git dir.

< etc > : other file formats.

< formatter-command > : the command which you can run for a single file and it would format that file.

--all at the end means to do this for all git branches (overall 4 dashes)

E.g. this is what I have, wherein my git contains src directory (apart from tests, tools, etc)

git filter-branch --tree-filter "find src -regex '.*.(cpp\|h\|cu\|inl)' -exec clang-format -style=google -i {} \;" -- --all

Above will rewrite each git commit, but not change the git annotation. Since this modifies git history, everyone would have to reclone once this is pushed.

Rishabh Agarwal
  • 1,988
  • 1
  • 16
  • 33
1

There must be a way to format the codebase while preserving the author information for each line.

One thing you could do is to branch from some earlier commit, reformat the code, and then rebase master to your branch. That would preserve authorship for all the changes that came after whatever commit you start from.

So that's the idea, but there are some big reasons that you shouldn't do it:

  1. Rebasing a shared branch is a bad idea. The fact that you even care about preserving the authorship of changes probably means that there are a number of people actively working on the code. If you go and rebase the master branch, then every fork or clone of your repo is going to have a master branch with the old history, and that's bound to cause confusion and pain unless you're very careful about managing the process and making certain that everybody is aware of what you're doing and updates their copies appropriately. A better approach would probably be to not rebase master, but instead merge the commits from master into your branch. Then, have everybody start using the new branch instead of master.

  2. Merge conflicts. In reformatting the entire codebase, you're probably going to make changes to a large number of lines in almost every file. When you merge the subsequent commits, whether that's via rebase or merge, you'll likely have a large number of conflicts to resolve. If you take the approach I suggested above and merge commits from master into your new branch instead of rebasing, then it'll be easier to resolve those conflicts in an orderly way because you can merge a few commits at a time until you're caught up.

  3. Incomplete solution. You're going to have to figure out where in the history you want to insert your reformatting operation. The farther back you go, the more you'll preserve the authorship of changes, but the more work it'll be to merge in the subsequent changes. So you'll probably still end up with lots of code where your reformatting commit is the latest change.

  4. Limited benefit. You never actually lose authorship information in git -- it's just that tools typically only show who made the most recent change. But you can still go back and look at prior commits and dig through the entire history of any piece of code, including who made it. So the only thing that inserting your reformatting operation into the history really buys you is the convenience of seeing who changed some piece of code without the extra step of going back to an earlier commit.

  5. It's dishonest. When you rewrite the history of a branch, you're changing a factual recording of how the code changed over time, and that can create real problems. Let's imagine that your reformatting isn't quite as inconsequential as you mean it to be, and in doing the reformatting you actually create a bug. Let's say, for example, that you introduce some extra white space into a multi-line string constant. Weeks later, somebody finally notices the problem and goes looking for the cause, and it looks like the change was made a year and a half ago (because that's where you inserted your reformatting into the history). But the problem seems new -- it doesn't show up in the build that shipped two months ago, so what the heck is going on?

  6. Benefit diminishes over time. As development continues, the changes that you're trying to hard not to cover up will be covered up by some other changes anyway, and your reformatting changes would likewise be superseded by those new changes. As time and development march on, the work you do to bury your reformatting changes won't mean much.

If you don't want your name showing up as the author of every line in your project, but you also don't want to live with the problems described above, then you might want to rethink you approach. A better solution might be to tackle the reformatting as a team: get everyone on the team to agree to run the formatter on any file that they change, and make proper formatting a requirement in all code reviews going forward. Over time, your team will cover most of the code, and the authorship information will be mostly appropriate since every file that gets reformatted was going to be changed anyway. You may eventually end up with a small number of files that never get reformatted because they're very stable and don't need updates, and you can choose to reformat them (because having some badly formatted files makes you nuts) or not (because nobody is really working in those files anyway).

Caleb
  • 124,013
  • 19
  • 183
  • 272