1

I recently learned about git blame and what it does. I want to know how git finds when each line was changed in a file, even across file renames. In other words, I want to know how the blame algorithm works.

Rak Laptudirm
  • 174
  • 2
  • 13

1 Answers1

2

First of all, the blame feature exists in almost all others SCM too, including CVS. So the algorithm used will vary according to the tool you're using.

Basically, however, the simplest way to achieve this is starting from the most recent state of your file, then browsing history backwards (toward the past) and applying the negative of each changeset.

Every affected row is marked as belonging to last commit, all other rows to previous one. Aside of this, you'll count the number of these latter rows. Then you restart this process with commit n-1 and n-2. If the rows don't explicitly belong to "n-1", they are ignored because this means they've been altered by some more recent commit (actually, the reverse changeset will still be applied, but commit number won't be updated). Otherwise, you apply the same computations, updating the commit number each row belongs to.

You then just have to iterate on this all the way down 'til initial commit if needed but if you reached a state where the "number of rows" quoted above reaches zero, you know you can stop here because it means that all the rows have been altered since the original state of the file and there's no more need to go any further.

Obsidian
  • 3,719
  • 8
  • 17
  • 30
  • A question. There can be the same line in many places across a file. And also, A line can change its line number when other things are inserted or deleted. How can these changes be recorded, making sure it is not a different line? Or basically how the "negative application" works. – Rak Laptudirm May 07 '21 at 11:21
  • Well, that's the whole point of a SCM tool and all of this is generally done using the "diff" tool, which used to exist for years under UNIX and is indeed really clever. Check out https://git-scm.com/docs/git-diff to see which options and different algorithms can be used. Doing this on a line-based file is a thing, doing the same on genomic analysis is a step further : it's told to be comparable to speech recognition. – Obsidian May 07 '21 at 11:31