How to find all the "active" git commits in a tree?

Question

I'd like to get a snapshot of "active" git commits has for a directory tree, meaning git commits that really are part of the build and not commits that have been fully superseded by newer commits.

I can do this by running git blame on every file and extracting the commits that way, but it's too slow to be practical on a large repo.

What's your purpose for doing this? Perhaps there's a better approach. — Schwern, Aug 03 '20 at 04:29
It’s to answer the question “where is my commit?” when you have a lot of branches and deployments of an app. I’m loading the commits into a data warehouse for cross referencing. — CCS, Aug 03 '20 at 04:33
What do you mean by "where is my commit"? Could this be solved by `git branch --contains`? — Schwern, Aug 03 '20 at 04:39
It seems to show if the commit is in the branch, but not if it’s been superseded by another commit. — CCS, Aug 03 '20 at 04:45
So is the question: for every file, tell me the last commit that changed it? — matt, Aug 03 '20 at 04:48
It’s more “tell me all the commits that make up every file”. So excluding commits that are in the branch but not relevant any more. — CCS, Aug 03 '20 at 04:50
@CCS What is the purpose of knowing if the lines of a commit are still unmodified in the code? What if they're trivially changed by a following commit, like a style change? — Schwern, Aug 03 '20 at 04:53
@Schwern It’s fine if they’re modified, but if it’s been fully replaced then I don’t need them in the list anymore. The reason is that if you have a change to your app, and there’s a problem that could be related your change or you need to explain the behavior of an app running in a deployment, it doesn’t help knowing your change was in the history if it’s not actually compiled into the app anymore. — CCS, Aug 03 '20 at 04:59
@CCS Could finding the problematic commit be better solved with `git bisect`? — Schwern, Aug 03 '20 at 05:07
@Schwern `git bisect` I think is more for when you don't know what the commit is that's causing a behaviour. In this case we're saying we know the commit and we're trying to match it up with a behaviour. Also, the commit can be a fix, so we're trying to determine if the fix really worked. — CCS, Aug 03 '20 at 05:11
@CCS If you already suspect a commit introduced a problem, normally you run the commit just before; if the problem is still there, it wasn't that commit. `git bisect` automates this process without having to first guess the problematic commit. You could cache `git blame` for each file and only update the cache for the files touched by each commit, but I'm struggling to see its utility. How have you chosen the commit to match up with a behavior? — Schwern, Aug 03 '20 at 05:22
What does it mean for a commit to be "part of the build"? You say that you want to list the commits that are not "fully superseded by newer commits". What if only an empty line remains from a certain commit? Or only a variable declaration? I don't think the data you're looking for would actually be useful. Checking that the "important" part of a commit is still present is unfortunately a task that needs human judgement. I mean, even if 100% of your patch is intact, the surroundings could have changed in a way that breaks it; e.g. if the function isn't even called anymore. — Snild Dolkow, Aug 03 '20 at 05:27
@SnildDolkowy yes we don't have a way to track logic, which is really the goal, but being able to eliminate commits that are completely irrelevant cuts down the human work and the storage required. — CCS, Aug 03 '20 at 06:02
@Schwern the commit is already known, so this is the point after you’ve fixed the bug or added a feature and are now trying to get your change into production. At some companies there can be a number of branches cut for testing and release and they can be behind master up to two weeks, just the reality at some companies. So making sure your commit is in all the right branches becomes non trivial. — CCS, Aug 03 '20 at 21:13

score 0 · Accepted Answer · answered Aug 03 '20 at 05:13

What git blame does is pretty much the only way to find the information you're looking for. However, you can simplify the action somewhat, and that might be enough for your purposes and perhaps that would be fast enough as well.

Remember, every commit has a full snapshot of every file. A branch name identifies the last commit in some chain of commits. So when you have:

... <-F <-G <-H   <-- branch

the name branch holds the raw hash ID of commit H. In commit H, there are many files, each of which has many lines. Those files are in the form they have in commit H, and that's all there is to it—except that commit H contains the hash ID of earlier commit G.

You can use hash ID this to locate commit G and extract all of its files, and when the file in G completely matches the file in H, that means that—in git blame terms at least—all the lines in the file in G are attributable to G, if not to some earlier commit. So files that are different in G and H should be attributed to H. The git blame command works on a line-by-line basis, attributing individual lines to commit H if they differ, but perhaps for your purposes, attributing the entire file to H suffices.

Should you decide that the file should perhaps be attributed to commit G, it is now time to extract commit F's hash ID from commit G, and use that to read all the files from commit F. If any given file in F matches the copy in G, the attribution moves back to F; otherwise it remains at G.

You must repeat this process until you run entirely out of commits:

A <-B <-C ... <-H

Since commit A has no parent, any files in A that are unchanged all the way through the last commit are to be attributed to commit A. You can, however, stop traversing backwards as soon as you have completely attributed all files that exist in H to some commit later in the chain. Compare this to git blame, which must keep looking backwards as long as at least one line is attributed to some earlier commit: you'll probably stop long before git blame must.

Moreover, because of Git's internal data structures, it is very fast to tell whether a file in some earlier commit exactly matches a file of the same name in some later one: every file in every commit is represented by a hash ID. If the hash ID is the same, the file's contents are bit-for-bit identical in the two commits. If not, they're not.

There is no convenient in-Git command to do exactly what you want,¹ and if you do intend to traverse the history like this, you must decide what to do with merges. Remember that a merge commit has a snapshot, but unlike a non-merge, has two or more parents:

...--o--K
         \
          M--o--o--...--o   <-- last
         /
...--o--L

Which commit(s) should you follow, if the file in M matches one or more of the files in K and/or L? The git log command has its own method of doing this—git log <start-point> -- <path> will simplify history by following one parent, chosen at random from the set of such parents, that has the same hash ID for the given file.

Note that you can use git rev-list, perhaps with --parents, to produce the set of hash IDs that you can choose to examine. The rev-list command is the workhorse for most other Git commands, including git blame itself, for following history like this. (Note: the git log command is built from the same source as git rev-list, with some minor command-line-option differences and different default outputs.)

¹While git log <start-point> -- <path> is useful here, it will be too slow to run this once for each path, and it's not effective to run it without giving individual paths.

This is really helpful, thank you. So if I'm following, the optimization here is to traverse the history by commit/file rather than file/line which is what `git blame` does? — CCS, Aug 03 '20 at 05:47
@CCS: yes, that's the general idea. You'll have to decide whether that really fits your desired outcome, but I think it would. — torek, Aug 03 '20 at 16:00

How to find all the "active" git commits in a tree?

1 Answers1