0

I have a large corpus of text, split into sentences. I have two versions of each sentence, one version has POS-tagged tokens. I want to POS tag everything in version 1. I want to do this by replacing the words in version 1 with their POS-tagged counterparts from version 2.

There are some complications with this:

  1. Spelling for the same word can be different between the two versions (e.g. 'cafe' vs. 'café').

  2. Spacing in the POS-tagged version doesn't always match spacing in the other (e.g. "did", "n't" vs. "didn't").

  3. One version uses symbols regularly while the other spells out the full word (e.g. '&' vs. 'and').

The language of the text isn't English, so the examples above are only a rough approximation of what's going on. Here are a couple of examples from the actual text. I hope it's easy to see how POS-tagged text in version 2 matches the text in version 1 closely, but not exactly; some words are missing, some are spelled differently, some are out of order, etc.

Example 1.
Version 1: ".o. omi adov-ztu jo znóyod sotfico pru & bra"
Version 2: [['omi', '<DET>'], ['adov', '<NOUN>'], ['ztu', '<PRON>'], ['znóyod', '<VERB>'],
           ['sotfico', '<ADJ>'], ['uont', '<CCONJ>'], ['jo', '<ADP>']]

Example 2.
Version 1: "vomoyj zíy"
Version 2: [['vó', '<SCONJ>'], ['ṁo', '<PART>'], ['yj', '<PRON>'], ['zíy', '<ADJ>']]

Example 3.
Version 1: ".o. fa-tistyjogot"
Version 2: [['fa', '<PP>'], ['t', '<IP>'], ['is', '<UU>'], ['fatistyjogot', '<VERB>']] 

In example 1 '&' maps to 'uont'. The words 'pru' and 'bra' in version 1 don't map to anything in version 2. The word, 'jo', is also in the wrong place in version 2, and needs to follow the word order of version 1.

In example 2 'vó', 'ṁo', and 'yj' all map to 'vomoyj', even though some characters are different, and it's split in two places.

In example 3 there is only one word, but parts of it are repeated. 'fa', 't', and 'is' all appear in 'fatisyjogot', so i can ignore everything except 'fatisyjogot' in version 2.

Where a word is tagged in version 2, I want to replace its counterpart in version 1 with the form from version 2 and the POS-tag. That way I can keep the word order of version 1. If no tagged form exists in version 2, I want to keep the word from version 1 and add the placeholder tag, '<X>'. I also need to leave out any content in version 2 if it is repeated like in example 3. So, from the examples above, I'd like to create the following lists:

Example 1: [['.o.', '<X>'], ['omi', '<DET>'], ['adov', '<NOUN>'], ['ztu', '<PRON>'], ['jo', '<ADP>'],
           ['znóyod', '<VERB>'], ['sotfico', '<ADJ>'], ['pru', '<X>'], ['uont', '<CCONJ>'], ['bra', '<X>']]
Example 2: [['vó', '<SCONJ>'], ['ṁo', '<PART>'], ['yj', '<PRON>'], ['zíy', '<ADJ>']]
Example 3: [['.o.', '<X>'], ['fatistyjogot', '<VERB>']]

I've tried writing a function using RegEx and the edit distance method from the nltk module to identify similar strings. It works well for longer strings, but because some strings are so short, like 'vó' above, it sometimes has difficulties. I've also looked at sequence alignment libraries, but found myself confused trying to apply them.

Is there any way to compare these strings and match every string in version 2 to some substring in version 1 with high accuracy? I can sort out the POS tags myself, I just need a way to find all of the corresponding tokens.

For example, can I write a function, give it the two versions as arguments, and get it to return all the related strings (and their index/placement in the sentence)?

v1 = "vomoyj zíy"
v2 = [['vó', '<SCONJ>'], ['ṁo', '<PART>'], ['yj', '<PRON>'], ['zíy', '<ADJ>']]

def some_func(v1, v2):
    *do something*
    return comparison_list

print(some_func(v1, v2))

Output:
[['vó', 'vomoyj', 0], ['ṁo', 'vomoyj', 1], ['yj', 'vomoyj', 2], ['zíy', 'zíy', 3]]
*OR*
[['vó', 'vo'], ['ṁo', 'mo'], ['yj', 'yj'], ['zíy', 'zíy']]

EDIT: It's not feasible to translate this to English to simplify the problem. I really need to just compare strings.

AdeDoyle
  • 361
  • 1
  • 14
  • if you have already have pos_tag, will it make sense if you convert all token to english token and than compare the both version ????? – qaiser Feb 12 '20 at 06:26
  • The language is highly under-resourced, and manually converting every token to English isn't feasible. English also isn't a good candidate for translating to, as it's morphologically quite simple, and couldn't capture the morphological distinctions between word forms here. – AdeDoyle Feb 13 '20 at 16:56

1 Answers1

0

you can convert token into english token and than you can use for finding similar token, and it position in string (here it is v1)

v1 = 'vomoyj ziy'
v2 = [['vó', '<SCONJ>'], ['ṁo', '<PART>'], ['yj', '<PRON>'], ['zíy', '<ADJ>']]

import unidecode
def comparison_func(v1,v2):
   output_ = []
   for token in v2:
      converted_token =   unidecode.unidecode(token[0])
      position =  v1.find(converted_token)         
      output_.append([token[0],v1[position:position+len(converted_token)],position])
   return output_

comparison_func(v1,v2)
#op
[['vó', 'vo', 0], ['ṁo', 'mo', 2], ['yj', 'yj', 4], ['zíy', 'ziy', 7]]
qaiser
  • 2,770
  • 2
  • 17
  • 29
  • for complication no 2, you can use this file https://github.com/dipanjanS/text-analytics-with-python/blob/master/Old-First-Edition/source_code/Ch03_Processing_and_Understanding_Text/contractions.py – qaiser Feb 12 '20 at 11:10
  • Thanks for the suggestion. I should have probably clarified in the post, so I've edited it now; it's not possible to convert this to English, at least, automatically. Doing so manually is unfeasible. – AdeDoyle Feb 13 '20 at 20:54