0

I have two texts, text A and text B. Text B isn't an exact copy of text A, it has a lot of special characters which aren't in text A, but it is technically the same text. I need to compare the strings and map the counterparts in Text B to their counterparts in Text A.

The text isn't in English, and can't easily be translated to English, so the examples below are just to demonstrate a few of the problems.

Some words in text A are not in text B, but all words in text B should be in Text A:

text_a = "he experienced déjà vu"
text_b = ['he', 'experienced']

Some words in text B use different characters to text A, but are the same words:

text_a = "she owns & runs the cafe florae"
text_b = ['she', 'owns', 'and', 'runs', 'the', 'cefé', 'floræ']

Words in text B are generally in the right order, but not always:

text_a = "an uneasy alliance"
text_b = ['uneasy', 'alliance', 'an']

Some words in text B are made up of smaller components, which are also included in text B, and these smaller components are unnecessary:

text_a = "we should withdraw our claim"
text_b = ['we', 'should', 'with', 'draw', 'withdraw', 'our', 'claim']

Some words in text A are represented by two or more words in text B:

text_a = "they undercut their competitors"
text_b = ['they', 'under', 'cut', 'their', 'competitors']

What I want to do is replace the words in text A with their counterparts from text B. To do this I need to write a function to match words between the two texts.

I've tried writing a function that compares strings using the edit distance method from the nltk library in conjunction with a handful of RegEx. This only does an OK job, so I looked at using sequence alignment techniques from libraries like biopython, but I can't get my head around these.

In particular, while using edit distance it's very difficult to match words like 'under' and 'cut' to 'undercut', while also avoiding errors in short strings. This is because in a sentence containing similar tokens, like 'to' and 'tu', these tokens have the same edit distance from something like 'tú', and would theoretically be equally valid candidates, though the obvious match here would be 'tu', not 'to'.

Is there any highly accurate way to match strings from text B in text A? I'd like to get an output like:

text_a = "the cafe florae undercut their competitors then withdrew their claim"
text_b = ['the', 'café', 'floræ', 'under', 'cut', 'their', 'competitors', 'then',
          'with', 'drew', 'withdrew', 'their', 'claim']

match_list = some_matchfunc(text_a, text_b)

print(match_list)

[['the', 'the'], ['cafe', 'café'], ['florae', 'floræ'], ['undercut', 'under'],
 ['undercut', 'cut'], ['their', 'their'], ['competitors', 'competitors'], ['then', 'then'],
 ['withdrew', 'withdrew'], ['their', 'their'], ['claim', 'claim']]

Ideally, this would also include the index for the beginning and end of each matched word in text A, to avoid confusion, like with the word "their" which occurs twice below:

[['the', [0, 3] 'the'], ['cafe', [4, 8] 'café'], ['florae', [9, 15] 'floræ'],
 ['undercut', [16, 24], 'under'], ['undercut', [16, 24], 'cut'], ['their', [25, 30], 'their'],
 ['competitors', [31, 42], 'competitors'], ['then', [43, 47], 'then'], ['withdrew', [48, 56], 'withdrew'],
 ['their', [57, 62], 'their'], ['claim', [63, 68], 'claim']]

As mentioned above, the text isn't in English, and translating it to compare words using NLP techniques isn't really feasible, so it does need to be based on string comparison. I think there must be some method or library already in existence which employs a more efficient sequence alignment algorithm than I can come up with using RegEx and edit distance, but I can't find one.

Does anybody know of a highly accurate method for comparing strings to achieve this outcome?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
AdeDoyle
  • 361
  • 1
  • 14
  • "What I want to do is replace the words in text A" this doesn't match the "expected output" you've shown, and is also confusing as your second example shows duplication in text B, while "replace" implies a one-to-one correspondence. – Hymns For Disco Feb 15 '20 at 14:58
  • "Does anybody know of a highly accurate method ... to achieve this outcome". Not unless you can rigorously define the relationship between text A and text B. It seems that you have just found yourself with an unwieldy data set and are seeking an intuitive outcome – Hymns For Disco Feb 15 '20 at 15:06
  • Admittedly, it may just be unwieldy data. Regarding your first comment, though, I would be able to use the "expected output" to replace the words in text A myself. I'm interested in having the list as the solution here, though, as it would allow me to quickly check matches manually before replacing anything in text A. – AdeDoyle Feb 15 '20 at 15:14
  • it may be beneficial to try to understand the concept of "similarity" from a linguistics perspective first before trying to implement an algorithm. For example, in English it is quite common and acceptable to substitute an accented character for an unaccented one without changing meaning, but this is not the case in every language. Perhaps you could ask for some further insight about the historical language you are working with at https://linguistics.stackexchange.com/ – Hymns For Disco Feb 15 '20 at 15:32
  • I don't think it's necessary to look for help with the language itself, as I'm quite familiar with it. You're right though, just as in English, it's common for accented and unaccented characters to be used interchangeably without changing the meaning in this language (as well as a handful of other characters, such as ṁ/m and ṅ/n). So I think an alternate form of edit distance would be preferable, where accented/unaccented forms of the same character are weighted less than entirely different characters. – AdeDoyle Feb 15 '20 at 17:20
  • you can implement a form of normalization, where you replace every character with it's plain equivalent, (ë->e, æ->ae) maybe if certain letters are often used interchangeably for the same sounds, something like (k->c, ph->f). And then when comparing words, always compare the normalized versions – Hymns For Disco Feb 15 '20 at 21:49

1 Answers1

1

The problem itself is very complex, and I would suggest a combination of dictionaries with suitable synonyms when suitable, and then falling back to a sequence alignment approach. The implementations in biopython are probably not really suitable for this case (BLAST, for instance, relies in a score matrix that does not make sense for real words, only to nucleotide or amino acid sequences). I suggest you have a look at SequenceMatcher, which could do the job. A very simple (albeit naive) solution is to do a pairwise alignment of all the candidates and pick the closest match. Depending on the complexity of the alignment, e.g. whether gaps/replacements are required (imagine "they're" -> "they are").

Bear in mind that in some cases many-to-many, one-to-many, and many-to-one substitutions will be required (you have some of these in your example already). This is not automatically solved by the sequence alignment, hence my suggestion to use a dictionary (a bidirectional one if can afford it). If the synonym corpus is very big, I would even consider a database for such tasks.

Also, some of your examples require word-level substitutions and some require letter-level substitutions. I suggest that you handle these separately. If you don't have to deal with typos, I would start with the bigger (word) scale, and then move on to the letter-level substitutions.

Mateo Torres
  • 1,545
  • 1
  • 13
  • 22
  • I'm working with a historical language stage which doesn't have any specific dictionary or thesaurus. There is a dictionary for modern descendants of the language, but these are grammatically very different and words' spellings have changed drastically over time. Even with the text I'm using, the spelling of one word can vary considerably (hence the matching issues). I'll take a look at `SequenceMatcher`.. – AdeDoyle Feb 15 '20 at 15:20
  • To me, this looks like a situation in which you'll need manual curation almost unavoidably. The dictionary I am referring to is indeed similar to a thesaurus, but "simpler" in the sense that it will be only for word substitutions and spelling variants. I think such a resource could be a community effort, but I don't think an automated solution exists for any language. If you find something like this, please update this question. I am interested on building language models for Guarani :) – Mateo Torres Feb 15 '20 at 15:29