I have two texts, text A and text B. Text B isn't an exact copy of text A, it has a lot of special characters which aren't in text A, but it is technically the same text. I need to compare the strings and map the counterparts in Text B to their counterparts in Text A.
The text isn't in English, and can't easily be translated to English, so the examples below are just to demonstrate a few of the problems.
Some words in text A are not in text B, but all words in text B should be in Text A:
text_a = "he experienced déjà vu"
text_b = ['he', 'experienced']
Some words in text B use different characters to text A, but are the same words:
text_a = "she owns & runs the cafe florae"
text_b = ['she', 'owns', 'and', 'runs', 'the', 'cefé', 'floræ']
Words in text B are generally in the right order, but not always:
text_a = "an uneasy alliance"
text_b = ['uneasy', 'alliance', 'an']
Some words in text B are made up of smaller components, which are also included in text B, and these smaller components are unnecessary:
text_a = "we should withdraw our claim"
text_b = ['we', 'should', 'with', 'draw', 'withdraw', 'our', 'claim']
Some words in text A are represented by two or more words in text B:
text_a = "they undercut their competitors"
text_b = ['they', 'under', 'cut', 'their', 'competitors']
What I want to do is replace the words in text A with their counterparts from text B. To do this I need to write a function to match words between the two texts.
I've tried writing a function that compares strings using the edit distance
method from the nltk
library in conjunction with a handful of RegEx. This only does an OK job, so I looked at using sequence alignment
techniques from libraries like biopython
, but I can't get my head around these.
In particular, while using edit distance it's very difficult to match words like 'under' and 'cut' to 'undercut', while also avoiding errors in short strings. This is because in a sentence containing similar tokens, like 'to' and 'tu', these tokens have the same edit distance from something like 'tú', and would theoretically be equally valid candidates, though the obvious match here would be 'tu', not 'to'.
Is there any highly accurate way to match strings from text B in text A? I'd like to get an output like:
text_a = "the cafe florae undercut their competitors then withdrew their claim"
text_b = ['the', 'café', 'floræ', 'under', 'cut', 'their', 'competitors', 'then',
'with', 'drew', 'withdrew', 'their', 'claim']
match_list = some_matchfunc(text_a, text_b)
print(match_list)
[['the', 'the'], ['cafe', 'café'], ['florae', 'floræ'], ['undercut', 'under'],
['undercut', 'cut'], ['their', 'their'], ['competitors', 'competitors'], ['then', 'then'],
['withdrew', 'withdrew'], ['their', 'their'], ['claim', 'claim']]
Ideally, this would also include the index for the beginning and end of each matched word in text A, to avoid confusion, like with the word "their" which occurs twice below:
[['the', [0, 3] 'the'], ['cafe', [4, 8] 'café'], ['florae', [9, 15] 'floræ'],
['undercut', [16, 24], 'under'], ['undercut', [16, 24], 'cut'], ['their', [25, 30], 'their'],
['competitors', [31, 42], 'competitors'], ['then', [43, 47], 'then'], ['withdrew', [48, 56], 'withdrew'],
['their', [57, 62], 'their'], ['claim', [63, 68], 'claim']]
As mentioned above, the text isn't in English, and translating it to compare words using NLP techniques isn't really feasible, so it does need to be based on string comparison. I think there must be some method or library already in existence which employs a more efficient sequence alignment algorithm than I can come up with using RegEx and edit distance, but I can't find one.
Does anybody know of a highly accurate method for comparing strings to achieve this outcome?