I'm contributing to a package (Python >= 3.5) that uses git blame
to retrieve information on files. I'm working on replacing the GitPython dependency with custom code supporting just the small subset of functionality we actually need (and provide the data in the form we actually need).
I found that git blame -lts
came closest to what I need, namely retrieving the commit SHA and line content for each line in a file. This gives me output like
82a3e5021b7131e31fc5b110194a77ebee907955 books/main/docs/index.md 5) Softwareplattform [ILIAS](https://www.ilias.de/), die an zahlreichen
I've processed this with
line_pattern = re.compile('(.*?)\s.*\s*\d\)(\s*.*)')
for line in cmd.stdout():
m = line_pattern.match(line)
if m:
sha = m.group(1)
content = m.group(2).strip()
which works well. However, that package's maintainer correctly warned that "This might introduce hard to debug errors for very specific groups of users. Probably needs to be heavily unit tested, across multiple OS and GIT versions."
I had come to my approach because I found the output of git blame --porcelain
somewhat tedious to parse.
30ed8daf1c48e4a7302de23b6ed262ab13122d31 1 1 1
author XY
author-mail <XY>
author-time 1580742131
author-tz +0100
committer XY
committer-mail <XY>
committer-time 1580742131
committer-tz +0100
summary Stub-Outline-Dateien
filename home/docs/README.md
hero: abcdefghijklmnopqrstuvwxyz
82a3e5021b7131e31fc5b110194a77ebee907955 18 18
82a3e5021b7131e31fc5b110194a77ebee907955 19 19
---
82a3e5021b7131e31fc5b110194a77ebee907955 20 20
...
I don't like the housekeepeing involved in that kind of iteration over string lists.
My question is:
1) should I better use the --porcelain
output because that is explicitly intended for machine consumption?
2) Can i expect this format to be robust over Git versions and OSes? Can I rely on the assumption that a line starting with a TAB character is the content line, that this is the final line of output for a source line, and that anything after that tab is the original line content?