4

Is there an easy way to determine the most likely part of speech tag for a given word without context using nltk. Or if not using any other tool / dataset.

I tried to use wordnet, but it seems that the sysnets are not ordered by likelihood.

>>> wn.synsets('says')

[Synset('say.n.01'), Synset('state.v.01'), ...]
alvas
  • 115,346
  • 109
  • 446
  • 738
Maarten
  • 4,549
  • 4
  • 31
  • 36

1 Answers1

6

If you want to try tagging without the context, you are looking for some sort of a unigram tagger, aka looup tagger. A unigram tagger tags a word solely based on the frequency of the tag given a word. So it avoids the context heuristics, however for any tagging task you must have data. And for the unigrams you need annotated data to train it. See the lookup tagger in the nltk tutorial http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html.

Below is another way of training/testing an unigram tagger in NLTK

>>> from nltk.corpus import brown
>>> from nltk import UnigramTagger as ut
>>> brown_sents = brown.tagged_sents()
# Split the data into train and test sets.
>>> train = int(len(brown_sents)*90/100) # use 90% for training
# Trains the tagger
>>> uni_tag = ut(brown_sents[:train]) # this will take some time, ~1-2 mins
# Tags a random sentence
>>> uni_tag.tag ("this is a foo bar sentence .".split())
[('this', 'DT'), ('is', 'BEZ'), ('a', 'AT'), ('foo', None), ('bar', 'NN'), ('sentence', 'NN'), ('.', '.')]
# Test the taggers accuracy.
>>> uni_tag.evaluate(brown_sents[train+1:]) # evaluate on 10%, will also take ~1-2 mins
0.8851469586629643

I wouldn't recommend using WordNet for pos tagging because just are sooo many words that are still has no entry in wordnet. But you can take a look at using lemma frequencies in wordnet, see How to get the wordnet sense frequency of a synset in NLTK?. These frequencies are based on the SemCor corpus (http://www.cse.unt.edu/~rada/downloads.html)

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    "A unigram tagger tags a word solely based on the frequency of the tag given a word." Ideally it would also look at the word endings, for smoothing. – Maarten Sep 25 '13 at 12:32