-3

In a git project containing an example file named file.txt, I'd like to have a script that:

  1. Parses the current whitespace-separated word (in the example, for the first iteration, this will be Enlargement). Maybe by using a regex like \b[A-za-z+]\b for word detection.
  2. Check if the word is minimum 5 characters in length. If not, keep moving to next word until this condition is satisfied. If satisfied, move to #3 below.
  3. Check entire history of the project to find out who originally made the commit that introduced this word.
  4. If author of that specific commit matches johndoe, then remove the word under consideration from the file.
  5. Repeat #1 -- #3 until all words from the file have been parsed and the original words by the specific author pruned off.

Treatment of Common Words:

It is important to ignore common keywords like a, an, the, of, for, if, then, but, else, not, any, or, nor. So, I propose to keep the minimum length to 5 characters in the string for the word to qualify for removal

Basically the idea is to eliminate or revert English-like contributions made by a particular author. How can this be done?

Post-Processing by latexdiff:

This question is for producing a diff report after removing contributions by the author. After pruning the text (ie. after I get the answer to this question),I intend to use a standard, yet amazing perl script latexdiff that can detect these word removals (or indeed any other difference among the two latex files) and output a composite PDF, highlighting the removed words with red striketrhoughs. All that I need to do is identify and remove the words originally introduced by the other author (i.e. my core question here). Therefore all sentences in the composite pdf shall remain coherent with no loss of meaning, but will continue to retain the removed words still in the same location but additionally simply have red strikethrough marks over them.

Background and Context:

This is in an academic context. The git project is a LaTeX repo of a manuscript. I am in an authorship dispute with a co-author of a paper which therefore did not get submitted to any journal. We are both PhD students. To claim each our copyright of the words for use in our respective theses, our PhD advisor has asked to submit our respective claims on the words introduced in the manuscript by each of us, for reuse in our theses and steer clear of plagiarism accusations. We both had committed to the same repo and now I am thinking of leveraging the power of git and shell along with git-grep, sed, awk, perl or whatever to help me claim the correct words I contributed with integrity. Your help will be much appreciated.

1 Answers1

1

I think it would be easier to do a full interactive rebase and drop all the other author's commits.

Be prepared to solve a lot of merge conflicts which will get harder and harder as you grudge through the resulting mess.

rubenvb
  • 74,642
  • 33
  • 187
  • 332
  • Interactivity is something I want to avoid because there are about 10 different files in the project which we both contributed to. Even if the results are sub-optimal we can take it. I want to just prove to my supervisor who wrote those "big" technical words. That's all. – Dr Krishnakumar Gopalakrishnan Jul 18 '18 at 12:20
  • 1
    Git does not know who wrote which words. Moving a sentence two sentences down will be seen as a removal and an addition. Does that make the commit author the one who wrote that word? In you planned scheme, it does, even though the commit author did in fact not add the word. I believe your approach to be broken. Also, what do big words have to do with anything at all? This whole thing sounds quite childish and useless. – rubenvb Jul 18 '18 at 12:22
  • I don't think you are correct. `git log -S --oneline 'enlargement' -- introduction.tex` shows the list of commits that touch that case-sensitive word, i.e. `enlargement`. The earliest commit in the list shall help to identify the committing author. I am already doing this manually. But I need to automate this. – Dr Krishnakumar Gopalakrishnan Jul 18 '18 at 12:26
  • Note that we just have to parse only the words in the latest checked out file (`HEAD`), and simply match the author of the earliest commit returned by the command `git log -S --oneline 'enlargement' -- introduction.tex` . Can you convince me why this scheme is broken. All that remains is to employ a standard word parser on the latest checked out file and the actual incantations of `awk` or `sed` to delete the word if `author==johndoe`. Can you convince me why I am wrong given this scheme works for me manually? Please try it out yourself by making a couple of commits in a fresh repo. – Dr Krishnakumar Gopalakrishnan Jul 18 '18 at 12:36
  • So the first author that wrote a word has sole claim on its use? How is that remotely related to how copyright/authorship works? I'm not going to try this myself because I think this endeavour is silly and does not represent nor solve the actual real problem. You both contributed to a paper, you're both authors, either publish it or don't, but the text in the manuscript belongs to both of you. Move on and do something useful. – rubenvb Jul 18 '18 at 12:39
  • Yes. That is how our librarian judges it. We are simply looking for the "big,technical words" that explain a concept. After all, a major chunk of a thesis is words, isn't it. Naturally, we want to reuse the words from the manuscript, rather than getting stuck. we have found a way forward like this. I mean, you could always argue and detract. But an imperfect solution is better than no solution and we both wish to move on. If you may contribute along the lines I suggested, I'd humbly appreciate it. I find the negative votes discouraging especially when I have valid use case. Almost feel shamed – Dr Krishnakumar Gopalakrishnan Jul 18 '18 at 12:41
  • How can both our theses contain the same words? We need to find out a way to sort it out isn't it. Now that, I have a starting point, I'd like to see how much I can pursue in that direction. – Dr Krishnakumar Gopalakrishnan Jul 18 '18 at 12:45
  • "How can both your theses contain the same words"? Have you ever read more than one paper on a single subject? You'd be surprised how alike lots of parts are. Now I start wondering how you would have tackled this problem if that joint paper was published. Not that that would change anything in this whole situation though. If it really did, just push it onto arXiv and be done with it. – rubenvb Jul 18 '18 at 12:52
  • No. Your understanding is wrong. Theses are individual work whereas papers need not be. Two authors cannot jointly submit a thesis together whereas by all means papers can have co-authors. My co-author and supervisor does not want to make use of a pre-print service like ArXiV at this stage, since it is not common in our field. A thesis is an independent piece of individual's work. Of course, others may have helped in it and they get acknowledged, but the majority of the work and all of the reporting must be individual. If paper had been published, we both would have to rewrite the sentences. – Dr Krishnakumar Gopalakrishnan Jul 18 '18 at 13:01