0

I have two lists originating from a part of speech tagger which look as follows:

pos_tags = [('This', u'DT'), ('is', u'VBZ'), ('a', u'DT'), ('test', u'NN'), ('sentence', u'NN'), ('.', u'.'), ('My', u"''"), ('name', u'NN'), ('is', u'VBZ'), ('John', u'NNP'), ('Murphy', u'NNP'), ('and', u'CC'), ('I', u'PRP'), ('live', u'VBP'), ('happily', u'RB'), ('on', u'IN'), ('Planet', u'JJ'), ('Earth', u'JJ'), ('!', u'.')]


pos_names = [('John', 'NNP'), ('Murphy', 'NNP')]

I want to create a final list which updates pos_tags with the list items in pos_names. So basically I need to find John and Murphy in pos_tags and replace the POS tag with NNP.

Markus
  • 43
  • 1
  • 4
  • To what does `[('Planet', u'JJ'), ('Earth', u'JJ')]` belong? – Joschua Dec 17 '14 at 14:31
  • 1
    Have you tried anything so far? – David Reeve Dec 17 '14 at 14:32
  • That was a copy and paste error which has now been rectified in the original post. – Markus Dec 17 '14 at 14:33
  • John and Murphy are already associated with NNP in your `pos_tags` list. Can you provide another example? Do you want to change the pos tag if a new one is seen? – xnx Dec 17 '14 at 14:34
  • I have tried some nested loops which didn't work. I am more a linguist than a programmer so this is all a bit overwhelming. – Markus Dec 17 '14 at 14:35
  • This is just a coincidence. To provide more background, the first lists originates from a classifier based POS tagger which often fails to identify names. The second list is generated by a tagger that aims at tagging names as NNP. So if I replace John with Markus then the list will show ('Markus',u'RB') which I would like to replace by ('Markus',u'NNP') if it is present in the pos_names list. – Markus Dec 17 '14 at 14:39

3 Answers3

0

You could create a dictionary from pos_names that behaves as a lookup table. Then you can use get to search the table for possible replacements, and leave the tag as-is if no replacement is found.

d = dict(pos_names)
pos_tags = [(word, d.get(word, tag)) for word, tag in pos_tags]
Kevin
  • 74,910
  • 12
  • 133
  • 166
0

Given

pos_tags = [('This', u'DT'), ('is', u'VBZ'), ('a', u'DT'), ('test', u'NN'), ('sentence', u'NN'), ('.', u'.'), ('My', u"''"), ('name', u'NN'), ('is', u'VBZ'), ('John', u'NNP'), ('Murphy', u'NNP'), ('and', u'CC'), ('I', u'PRP'), ('live', u'VBP'), ('happily', u'RB'), ('on', u'IN'), ('Planet', u'JJ'), ('Earth', u'JJ'), ('!', u'.')]

and

names = ['John', 'Murphy']

you can do:

[next((subl for subl in pos_tags if name in subl)) for name in names]

which will give you:

[('John', u'NNP'), ('Murphy', u'NNP')]
tdc
  • 8,219
  • 11
  • 41
  • 63
0

I agree a dictionary would be a more natural solution to this problem, but if you need your pos_tags in order a more explicit solution would be:

for word, pos in pos_names:
    for i, (tagged_word, tagged_pos) in enumerate(pos_tags):
        if word == tagged_word:
            pos_tags[i] = (word,pos)

(A dictionary would probaby be faster for a large number of words, so you might want to consider storing the word order in a list and doing your POS allocation using a dictionary).

xnx
  • 24,509
  • 11
  • 70
  • 109