I have a large corpus of text, split into sentences. I have two versions of each sentence, one version has POS-tagged tokens. I want to POS tag everything in version 1. I want to do this by replacing the words in version 1 with their POS-tagged counterparts from version 2.
There are some complications with this:
Spelling for the same word can be different between the two versions (e.g.
'cafe'
vs.'café'
).Spacing in the POS-tagged version doesn't always match spacing in the other (e.g.
"did", "n't"
vs."didn't"
).One version uses symbols regularly while the other spells out the full word (e.g.
'&'
vs.'and'
).
The language of the text isn't English, so the examples above are only a rough approximation of what's going on. Here are a couple of examples from the actual text. I hope it's easy to see how POS-tagged text in version 2 matches the text in version 1 closely, but not exactly; some words are missing, some are spelled differently, some are out of order, etc.
Example 1.
Version 1: ".o. omi adov-ztu jo znóyod sotfico pru & bra"
Version 2: [['omi', '<DET>'], ['adov', '<NOUN>'], ['ztu', '<PRON>'], ['znóyod', '<VERB>'],
['sotfico', '<ADJ>'], ['uont', '<CCONJ>'], ['jo', '<ADP>']]
Example 2.
Version 1: "vomoyj zíy"
Version 2: [['vó', '<SCONJ>'], ['ṁo', '<PART>'], ['yj', '<PRON>'], ['zíy', '<ADJ>']]
Example 3.
Version 1: ".o. fa-tistyjogot"
Version 2: [['fa', '<PP>'], ['t', '<IP>'], ['is', '<UU>'], ['fatistyjogot', '<VERB>']]
In example 1 '&'
maps to 'uont'
. The words 'pru'
and 'bra'
in version 1 don't map to anything in version 2. The word, 'jo'
, is also in the wrong place in version 2, and needs to follow the word order of version 1.
In example 2 'vó'
, 'ṁo'
, and 'yj'
all map to 'vomoyj'
, even though some characters are different, and it's split in two places.
In example 3 there is only one word, but parts of it are repeated. 'fa'
, 't'
, and 'is'
all appear in 'fatisyjogot'
, so i can ignore everything except 'fatisyjogot'
in version 2.
Where a word is tagged in version 2, I want to replace its counterpart in version 1 with the form from version 2 and the POS-tag. That way I can keep the word order of version 1. If no tagged form exists in version 2, I want to keep the word from version 1 and add the placeholder tag, '<X>'
. I also need to leave out any content in version 2 if it is repeated like in example 3. So, from the examples above, I'd like to create the following lists:
Example 1: [['.o.', '<X>'], ['omi', '<DET>'], ['adov', '<NOUN>'], ['ztu', '<PRON>'], ['jo', '<ADP>'],
['znóyod', '<VERB>'], ['sotfico', '<ADJ>'], ['pru', '<X>'], ['uont', '<CCONJ>'], ['bra', '<X>']]
Example 2: [['vó', '<SCONJ>'], ['ṁo', '<PART>'], ['yj', '<PRON>'], ['zíy', '<ADJ>']]
Example 3: [['.o.', '<X>'], ['fatistyjogot', '<VERB>']]
I've tried writing a function using RegEx and the edit distance method from the nltk
module to identify similar strings. It works well for longer strings, but because some strings are so short, like 'vó'
above, it sometimes has difficulties. I've also looked at sequence alignment libraries, but found myself confused trying to apply them.
Is there any way to compare these strings and match every string in version 2 to some substring in version 1 with high accuracy? I can sort out the POS tags myself, I just need a way to find all of the corresponding tokens.
For example, can I write a function, give it the two versions as arguments, and get it to return all the related strings (and their index/placement in the sentence)?
v1 = "vomoyj zíy"
v2 = [['vó', '<SCONJ>'], ['ṁo', '<PART>'], ['yj', '<PRON>'], ['zíy', '<ADJ>']]
def some_func(v1, v2):
*do something*
return comparison_list
print(some_func(v1, v2))
Output:
[['vó', 'vomoyj', 0], ['ṁo', 'vomoyj', 1], ['yj', 'vomoyj', 2], ['zíy', 'zíy', 3]]
*OR*
[['vó', 'vo'], ['ṁo', 'mo'], ['yj', 'yj'], ['zíy', 'zíy']]
EDIT: It's not feasible to translate this to English to simplify the problem. I really need to just compare strings.