0

I have got a repository with more than 10.000 entries. I dont want to take care about renamed files. What would be the best approach to count the number of changes done to a file?

My idea was to iterate over all commits and comparing the target.sha of the file with its parent commit. If the SHA is the same, the file was not changed. If the SHA is different a file change was found, meaning this is a new version.

foreach(Commit c in repository.Commits)
{
//DO THE WORK
}

This takes some time, but was the fastest i could get for now.

Maybe someone has a better idea?

JoeLiBuDa
  • 219
  • 2
  • 10

2 Answers2

3

The way you describe is basically as fast as you're going to get it. What's left would be any optimisations specific to your implementation of the solution, but without posting code, we cannot comment on that.

It could be worth comparing the trees that lead to the file instead of only the file to save a few allocations and marshalling costs; but you won't really get better algorithmically than comparing the tree entries.

Carlos Martín Nieto
  • 5,207
  • 1
  • 15
  • 16
0

This would actually be your best bet. It's the same approach as Git takes to solve the issue so it would take a lot of work to make it work any better, faster, and as reliably. You could try using a faster hashing algorithm like MD5 if all you care about is counting the number of commits where changes are made.

NOTE: Theoretically you could encounter some accuracy issues with MD5 but only for incredibly large data sets and it should suffice for your needs.

gfish3000
  • 1,557
  • 1
  • 11
  • 22
  • What do you mean by using a MD5 alogrithm with LibGit2Sharp. How should that work? I dont want to make changes on the library. – JoeLiBuDa Apr 14 '14 at 18:35
  • 2
    The contents are already hashed. This approach is comparing the commit IDs, which is the *already computed hash*, it is not recomputing the hash. That would be very expensive indeed! – Edward Thomson Apr 14 '14 at 18:53