In a git project containing an example file named file.txt
, I'd like to have a script that:
- Parses the current whitespace-separated word (in the example, for the first iteration, this will be
Enlargement
). Maybe by using a regex like\b[A-za-z+]\b
for word detection. - Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.
- Check entire history of the project to find out who originally made the commit that introduced this word.
- If author of that specific commit matches
johndoe
, then remove the word under consideration from the file. - Repeat #1 -- #3 until all words from the file have been parsed and the original words by the specific author pruned off.
Treatment of Common Words:
It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor
. So, I propose to keep the minimum length to 5 characters
in the string for the word to qualify for removal
Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?
Post-Processing by latexdiff:
This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff
that can detect these word removals (or indeed any other difference among the two latex
files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.
Background and Context:
This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git
and shell
along with git-grep
, sed
, awk
, perl
or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.