Find new inserted words in text file

Question

I want to find the new words which are inserted into a text file using Python. For example:

Old: He is a new employee here.
New: He was a new, employee there.

I want this list of words as output: ['was', ',' ,'there']

I used difflib but it gives me the diff in a bad formatted way using '+', '-' and '?'. I would have to parse the output to find the new words. Is there an easy way to get this done in Python?

Jordan McQueen · Answer 1 · 2016-10-29T05:24:08.480

You can accomplish this with the re module.

import re

# create a regular expression object
regex = re.compile(r'(?:\b\w{1,}\b)|,')

# the inputs
old = "He is a new employee here."
new = "He was a new, employee there."

# creating lists of the words (or commas) in each sentence
old_words = re.findall(regex, old)
new_words = re.findall(regex, new)

# generate a list of words from new_words if it isn't in the old words
# also checking for words that previously existed but are then added
word_differences = []
for word in new_words:
    if word in old_words:
        old_words.remove(word)
    else:
        word_differences.append(word)

# print it out to verify
print word_differences

Note that if you want to add other punctuation such as a bang or semi-colon, you must add it to the regular expression definition. Right now, it only checks for words or commas.

But, if the old text contained the word "there" in some other place, would it return this word? — Hellboy, Oct 29 '16 at 05:13
Ah yes, you're correct. The idea remains the same, but there's a simple fix for that degenerative case. I'll edit to accommodate. — Jordan McQueen, Oct 29 '16 at 05:15

score 0 · Answer 2 · answered Oct 29 '16 at 05:57

0

I used Google Diff-Patch-Match. It works fine.

answered Oct 29 '16 at 05:57

Hellboy

1,199
2
15
33

Find new inserted words in text file

2 Answers2