0

I'm working in a project with OCR. After some operations I have two strings like that:

s1 = "This text is a test of"
s2 = "a test of the reading device"

I would like to know how can I remove the repetead words of the second string. My idea is to find the position of the word that is repeated in each list. I tried this:

e1 = [x for x in s1.split()]
e2 = [y for y in s2.split()]

for i, item2 in enumerate(e2):
    if item2 in e1:
        print i, item2 #repeated word and index in the first string
        print e1.index(item2) #index in the second string

Now I have the repeated words and their position in the first and second list. I need it to compare word to word if these are in the same order. This because may happen that the same word appear two or more times in the string (future validation).

At the end I would like to have a final string like that:

ns2 = "the reading device"    
sf= "This text is a test of the reading device"

I'm using python 2.7 on Windows 7.

Alex Ortega
  • 45
  • 11
  • Documentation exists. Please use it. https://docs.python.org/3.6/tutorial/datastructures.html#more-on-lists –  Jan 11 '17 at 06:29
  • Use `e1.index(item2)` to find out where item2 is in e1 – Moberg Jan 11 '17 at 07:34

2 Answers2

2

Here is an another attempt,

from difflib import SequenceMatcher as sq
match = sq(None, s1, s2).find_longest_match(0, len(s1), 0, len(s2))

Result

print s1 + s2[match.b+match.size:]

This text is a test of the reading device

Rahul K P
  • 15,740
  • 4
  • 35
  • 52
  • It works fine but what happens if I have something like [that](http://stackoverflow.com/questions/41624787/how-to-delete-invalid-characters-between-multiple-strings-in-python/41624839#41624839). I hope that you can help me! – Alex Ortega Jan 12 '17 at 23:49
0

Maybe this?
' '.join([x for x in s1.split(' ')] + [y for y in s2.split(' ') if y not in s1.split(' ')]) I haven't test it carefully but this may be a good idea for dealing with such kind of demands.

Hou Lu
  • 3,012
  • 2
  • 16
  • 23