2

I'm working on a program that analyzes git blame history over time, starting with the first commit of a given file all the way to HEAD, along a given branch.

Currently, the way I'm doing it is:

  1. Use git log --pretty='%H %ad' --date=unix <branch> to get a list of every commit on the branch.
  2. For each commit in that list, individually, use git blame --date=unix --minimal -l -e -w <commit> <filename> and parse the results.

The problem is that this takes a long time. Plus, I'm actually doing this for every file in a repo, over multiple repos. So worst case for a given repo I think it's something like O(number_of_files * number_of_commits). A lot of the time is taken up by spawning git processes. For a tiny repo with a few dozen files and a few hundred commits it takes almost 3 minutes (to run git about 16,000 times), and its already fully parallelized.

My question is, is there a way to get the complete blame history (e.g. if a line was changed multiple times in multiple commits) for every change to a given file (still one file at a time, though) in a single git command, so that I can reduce the amount of time this takes? I'd like to reduce it to O(number_of_files). This is my first optimization target, I just haven't been able to figure out if there's a way to do it yet.

I looked at the output from git blame --incremental but, unless I'm misreading (I didn't do a proper comparison so I might be wrong here), it still only gives the blames for the most recent changes, not every change at once.

Is it possible to do this and, if so, how?

Jason C
  • 38,729
  • 14
  • 126
  • 182
  • 1
    First, sometimes git is not the right answer. Consider using libgit2 instead (so you can avoid spawning process after process). Second, I wonder why you run blame on each commit. I would need to try myself to get all the details right _but_ it sounds to me like you should get the _diff_ against the parent for each commit (what the older commits did on the file should show up when processing each of those commits). Again: maybe I am missint something (like: how do you deal with non-linear history?). – eftshift0 Dec 10 '22 at 05:10
  • 1
    I think it should be interesting trying to process something like: `git log --graph --patch -- the-file` so you can get all changes in a single shot.... perhaps more options would make it easier to parse... and I wonder if `--graph` would work to allow you to see the ancestor (in terms of this file) for non-linear history. – eftshift0 Dec 10 '22 at 05:28
  • "...program that analyzes git blame history over time, starting with the first commit of a given file all the way to HEAD, along a given branch..." 1. Please explain - why? I can't imagine any reasonable useful **business-goal** for this task 2. If you want commits only for "a given branch", uou have at least use `git log ` 3. a) blame for *forward-walking* history is slightly more than completely useless - it contains less information than log b) blame for *backward-walking* history is duplicate of log (not in output format, but in information) with possible gaps compared to log – Lazy Badger Dec 10 '22 at 10:18
  • @eftshift0 - `git log --graph` definitely **isn't* "easier to parse" – Lazy Badger Dec 10 '22 at 10:22
  • 1
    This is definitely a difficult problem, which is one reason Git does not solve it yet. – torek Dec 10 '22 at 11:23
  • @LazyBadger I use `git log ` to get the commit hashes that I analyze; that's how it is constrained to a branch. – Jason C Dec 10 '22 at 16:03
  • @eftshift0 For non-linear history, it's OK, it's satisfactory to use the commit list that `git log` reports. Good suggestion re `libgit2`; this particular application is in Node.js, and bindings definitely seem to exist; I'll check it out. – Jason C Dec 10 '22 at 16:04
  • @eftshift0 As for `--graph`, that is an interesting idea. I'll play with that today and let you know how it goes. I don't mind a good parsing challenge, and I'm curious about the performance tradeoff. Also even without looking I'd bet there's probably a diff parser somewhere in npm already, heh. Thanks. – Jason C Dec 10 '22 at 16:05
  • @eftshift0 So I switched over to libgit2 (via nodegit), and surprisingly, while it's blazing fast for getting the commit and file list, it's almost 4x slower for getting the blames than just using `git` directly. I was not expecting that kind of poor performance based on the success with the commit lists and file lists, was really expecting it to blow through all the blames. An unexpected disappointment. No idea why it's so slow. I suspect it might be doing a certain portion of the processing synchronously compared to the parallel performance of the separate `git` processes. – Jason C Dec 12 '22 at 22:36

0 Answers0