5

In the following code, why does nltk think 'fish' is an adjective and not a noun?

>>> import nltk
>>> s = "a woman needs a man like a fish needs a bicycle"
>>> nltk.pos_tag(s.split())
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]
waigani
  • 3,570
  • 5
  • 46
  • 71
  • see http://stackoverflow.com/questions/30821188/python-ntlk-pos-tag-not-returnig-the-correct-pos – alvas Jun 13 '15 at 22:35

5 Answers5

4

I am not sure what is the workaround but you can check the source here https://nltk.googlecode.com/svn/trunk/nltk/nltk/tag/

Meanwhile I tried your sentence with little different approach.

>>> s = "a woman needs a man. A fish needs a bicycle"
>>> nltk.pos_tag(s.split())
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man.', NP'), ('A','NNP'),   ('fish', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('bicycle', 'NN')]

which resulted in fish as "NN".

Chandan Gupta
  • 1,410
  • 2
  • 13
  • 29
4

If you used a Lookup Tagger as described in the NLTK book, chapter 5 (for example using WordNet as lookup reference) first, your tagger would already "know" that fish cannot be an adjective. For all words with several possible POS Tags you could then use a statistical tagger as a backoff tagger.

Suzana
  • 4,251
  • 2
  • 28
  • 52
  • Can you give an example of the statistical tagger you refer to at the end of your answer? – Private Jul 06 '15 at 11:20
  • Most POS taggers in the NLTK make use of statistics of word / feature combinations. For example, [TNT](http://www.nltk.org/api/nltk.tag.html#nltk.tag.tnt.TnT) and [Naive Bayes](http://www.nltk.org/api/nltk.classify.html#nltk.classify.naivebayes.NaiveBayesClassifier). – Suzana Jul 08 '15 at 14:17
3

It's because you want a woman needs a man like a fish needs a bicycle to get POS tags for such a "parse":

[ [[a woman] needs [a man]] like [[a fish] needs [a bicycle]] ]

but instead the NLTK default pos tagger isn't smart enough and gave you POS tag for such a parse:

[ [[a woman] needs [a man]] like [a fish needs] [a bicycle] ]

alvas
  • 115,346
  • 109
  • 446
  • 738
3

It depends on how the POS tagger is given the input. For example for the sentence: "a woman needs a man like a fish needs a bicycle"

If you use the default nltk word tokenizer and a regex tokenizer, the values will be different.

import nltk 
from nltk.tokenize import RegexpTokenizer

TOKENIZER = RegexpTokenizer('(?u)\W+|\$[\d\.]+|\S+')

s = "a woman needs a man like a fish needs a bicycle"

regex_tokenize = TOKENIZER.tokenize(s)
default_tokenize = nltk.word_tokenize(s)

regex_tag = nltk.pos_tag(regex_tokenize)
default_tag = nltk.pos_tag(default_tokenize)

print regex_tag
print "\n"
print default_tag

The output is as follows:

  Regex Tokenizer: 

[('a', 'DT'), (' ', 'NN'), ('woman', 'NN'), (' ', ':'), ('needs', 'NNS'), (' ', 'VBP'), ('a', 'DT'), (' ', 'NN'), ('man', 'NN'), (' ', ':'), ('like', 'IN'), (' ', 'NN'), ('a', 'DT'), (' ', 'NN'), ('fish', 'NN'), (' ', ':'), ('needs', 'VBZ'), (' ', ':'), ('a', 'DT'), (' ', 'NN'), ('bicycle', 'NN')]

 Default Tokenizer: 

[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]

In Regex Tokenizer fish is a noun while in the default tokenizer fish is an adjective. According to the tokenizer used, the parsing differs resulting in different parse tree structure.

Aravind Asok
  • 514
  • 1
  • 7
  • 18
2

If you use the Stanford POS tagger (3.5.1) then the phrase is correctly tagged:

from nltk.tag.stanford import POSTagger
st = POSTagger("/.../stanford-postagger-full-2015-01-30/models/english-left3words-distsim.tagger",
               "/.../stanford-postagger-full-2015-01-30/stanford-postagger.jar")
st.tag("a woman needs a man like a fish needs a bicycle".split())

yields:

[('a', 'DT'),
 ('woman', 'NN'),
 ('needs', 'VBZ'),
 ('a', 'DT'),
 ('man', 'NN'),
 ('like', 'IN'),
 ('a', 'DT'),
 ('fish', 'NN'),
 ('needs', 'VBZ'),
 ('a', 'DT'),
 ('bicycle', 'NN')]
Raffael
  • 19,547
  • 15
  • 82
  • 160