Unigram tagging in NLTK

Question

Using NLTK Unigram Tagger, I am training sentences in Brown Corpus

I try different categories and I get about the same value. The value is around 0.9328... for each categories such as fiction, romance or humor

from nltk.corpus import brown


# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324

Why is it that the case? is it because they are from the same corpus? or are their part-of-speech tagging is the same?

Could you please post your code so that we can try to reproduce this? — Rahul P, Mar 03 '20 at 03:49

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

It looks like you are training and then evaluating the trained UnigramTagger on the same training data. Take a look at the documentation of nltk.tag and specifically the part about evaluation.

With your code, you will get a high score which is quite obvious because your training data and evaluation/testing data is the same. If you were to change that where the testing data is different from the training data, you will get different results. My examples are below:

Category: Fiction

Here I have used the training set as brown.tagged_sents(categories='fiction')[:500] and the test/evaluation set as brown.tagged_sents(categories='fiction')[501:600]

from nltk.corpus import brown
import nltk

# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])

This gives you a score of ~ 0.7474610697359513

Category: Romance

Here I have used the training set as brown.tagged_sents(categories='romance')[:500] and the test/evaluation set as brown.tagged_sents(categories='romance')[501:600]

from nltk.corpus import brown
import nltk

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])

This gives you a score of ~ 0.7046799354491662

I hope this helps and answers your question.

Unigram tagging in NLTK

1 Answers1