2

Is there a way to use NLTK to get a set of possible parts of speech of a single string of letters, taking into account that different words might have homonyms?

For example: report -> {Noun, Verb} , kind -> {Adjective, Noun}

I have not been able to find a POS-tokenizer that tags part-of-speech for words outside of the context of a full sentence. This seems like a very basic request of NLTK, so I'm confused as to why I've had so much trouble finding it.

Leland Reardon
  • 135
  • 2
  • 12

2 Answers2

6

Yes. The simplest way is not to use a tagger, but simply load up one or more corpora and collect the set of all tags for the word you are interested in. If you're interested in more than one word, it's simplest to collect the tags for all words in the corpus, then look up anything you want. I'll add frequency counts, just because I can. For example, using the Brown corpus and the simple "universal" tagset:

>>> wordtags = nltk.ConditionalFreqDist((w.lower(), t) 
        for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))
>>> wordtags["report"]
FreqDist({'NOUN': 135, 'VERB': 39})
>>> list(wordtags["kind"])
['ADJ', 'NOUN']
alexis
  • 48,685
  • 16
  • 101
  • 161
4

Because POS models are trained on sentence/document based data, so the expected input to the pre-trained model is a sentence/document. When there's only a single word, it treats it as a single word sentence, hence there should only be one tag in that single word sentence context.

If you're trying to find all possible POS tags per English words, you would need a corpus of many different use of the words and then tag the corpus and count/extract the no. of tags per word. E.g.

>>> from nltk import pos_tag
>>> sent1 = 'The coaches are going from Singapore to Frankfurt'
>>> sent2 = 'He coaches the football team'
>>> pos_tag(sent1.split())
[('The', 'DT'), ('coaches', 'NNS'), ('are', 'VBP'), ('going', 'VBG'), ('from', 'IN'), ('Singapore', 'NNP'), ('to', 'TO'), ('Frankfurt', 'NNP')]
>>> pos_tag(sent2.split())
[('He', 'PRP'), ('coaches', 'VBZ'), ('the', 'DT'), ('football', 'NN'), ('team', 'NN')]

>>> from collections import defaultdict, Counter
>>> counts = defaultdict(Counter)
>>> tagged_sents = [pos_tag(sent) for sent in [sent1.split(), sent2.split()]]

>>> for word, pos in chain(*tagged_sents):
...     counts[word][pos] += 1
... 

>>> counts
defaultdict(<class 'collections.Counter'>, {'from': Counter({'IN': 1}), 'to': Counter({'TO': 1}), 'Singapore': Counter({'NNP': 1}), 'football': Counter({'NN': 1}), 'coaches': Counter({'VBZ': 1, 'NNS': 1}), 'going': Counter({'VBG': 1}), 'are': Counter({'VBP': 1}), 'team': Counter({'NN': 1}), 'The': Counter({'DT': 1}), 'Frankfurt': Counter({'NNP': 1}), 'the': Counter({'DT': 1}), 'He': Counter({'PRP': 1})})

>>> counts['coaches']
Counter({'VBZ': 1, 'NNS': 1})

Alternatively, there's WordNet:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('coaches')
[Synset('coach.n.01'), Synset('coach.n.02'), Synset('passenger_car.n.01'), Synset('coach.n.04'), Synset('bus.n.01'), Synset('coach.v.01'), Synset('coach.v.02')]
>>> [ss.pos() for ss in wn.synsets('coaches')]
[u'n', u'n', u'n', u'n', u'n', u'v', u'v']
>>> Counter([ss.pos() for ss in wn.synsets('coaches')])
Counter({u'n': 5, u'v': 2})

But note that WordNet is a manually crafted resource, so you cannot expect every English word to be in it.

alvas
  • 115,346
  • 109
  • 446
  • 738