0

I want to find the new words which are inserted into a text file using Python. For example:

Old: He is a new employee here.
New: He was a new, employee there.

I want this list of words as output: ['was', ',' ,'there']

I used difflib but it gives me the diff in a bad formatted way using '+', '-' and '?'. I would have to parse the output to find the new words. Is there an easy way to get this done in Python?

Hellboy
  • 1,199
  • 2
  • 15
  • 33

2 Answers2

0

You can accomplish this with the re module.

import re

# create a regular expression object
regex = re.compile(r'(?:\b\w{1,}\b)|,')

# the inputs
old = "He is a new employee here."
new = "He was a new, employee there."

# creating lists of the words (or commas) in each sentence
old_words = re.findall(regex, old)
new_words = re.findall(regex, new)

# generate a list of words from new_words if it isn't in the old words
# also checking for words that previously existed but are then added
word_differences = []
for word in new_words:
    if word in old_words:
        old_words.remove(word)
    else:
        word_differences.append(word)

# print it out to verify
print word_differences

Note that if you want to add other punctuation such as a bang or semi-colon, you must add it to the regular expression definition. Right now, it only checks for words or commas.

Jordan McQueen
  • 777
  • 5
  • 10
  • 1
    But, if the old text contained the word "there" in some other place, would it return this word? – Hellboy Oct 29 '16 at 05:13
  • Ah yes, you're correct. The idea remains the same, but there's a simple fix for that degenerative case. I'll edit to accommodate. – Jordan McQueen Oct 29 '16 at 05:15
0

I used Google Diff-Patch-Match. It works fine.

Hellboy
  • 1,199
  • 2
  • 15
  • 33