27

I am working on a program written by several folks with largely varying skill level. There are files in there that have never changed (and probably never will, as we're afraid to touch them) and others that are changing constantly.

I wonder, are there any tools out there that would look at the entire repo history (git) and produce analysis on how frequently a given file changes? Or package? Or project?

It would be of value to recognize that (for example) we spent 25% of our time working on a set of packages, which would be indicative or code's fragility, as compared with code that "just works".

James Raitsev
  • 92,517
  • 154
  • 335
  • 470

5 Answers5

11

If you're looking for an OS solution, I'd probably consider starting with gitstats and look at extending it by grabbing file logs and aggregating that data.

Dave Newton
  • 158,873
  • 26
  • 254
  • 302
  • 1
    I especially appreciated gitstats' `merge_authors` feature, which enables cleaning up where the same person has committed under different author names. cf https://gitorious.org/gitstats/mainline/commit/005fe0bbcab967367e4932d11b161f9f0f71cf7f – Noah Sussman Oct 28 '13 at 20:39
8

I'd have a look at NChurn:

NChurn is a utility that helps asses the churn level of your files in your repository. Churn can help you detect which files are changed the most in their life time. This helps identify potential bug hives, and improper design.The best thing to do is to plug NChurn into your build process and store history of each run. Then, you can plot the evolution of your repository's churn.

Henrik
  • 9,714
  • 5
  • 53
  • 87
  • NChurn works well - and it runs fast. It counts the number of checkins per file for a date range in the repo. (It needs a trivial NPE fix, or be sure to include an "exclude" list). – Dave C Aug 04 '16 at 13:59
6

I wrote something that we use to visualize this information successfully.

https://github.com/bcarlso/defect-density-heatmap

Take a look at the project and you can see what the output looks like in the readme.

You can do what you need by first getting a list of files that have changed in each commit from Git.

~ $ git log --pretty="format:" --name-only | grep -v ^$ > file-changes.txt

~ $ for i in `cat file-changes.txt | cut -d"." -f1,2 | uniq`; do num=`cat file-changes.txt | grep $i | wc -l`; if (( $num > 1 )); then echo $num,0,$i; fi; done | heatmap > results.html 

This will give you a tag cloud with files that churn more will show up larger.

Dave Newton
  • 158,873
  • 26
  • 254
  • 302
bcarlso
  • 2,345
  • 12
  • 12
  • 2
    The second bit doesn't really scale well. `sort file-changes.txt |uniq -c|sed -e 's/^ *//' -e 's/ /,0,/' >heatmap.in` or something to that effect should be faster. – cdegroot Sep 17 '15 at 21:03
5

I suggest using a command like

git log --follow -p file

That will give you all the changes that happened to the file in the history (including renames). If you want to get the number of commits that changed the file then you can do on a UNIX-based OS :

git log --follow --format=oneline Gemfile | wc -l

You can then create a bash script to apply this to multiple files with the name aside.

Hope it helped !

Cydonia7
  • 3,744
  • 2
  • 23
  • 32
2

Building on a previous answer I suggest the following script to parse all project files

#!/bin/sh
cd $1
find . -path ./.git -prune -o -name "*" -exec sh -c 'git log --follow --format=oneline $1 | wc -l | awk "{ print \$1,\"\\t\",\"$1\" }" ' {} {} \; | sort -nr
cd ..

If you call the script as file_churn.sh you can parse your git project directory calling

> ./file_churn.sh project_dir

Hope it helps.

aolchik
  • 63
  • 6