Evaluating POS tagger in NLTK

Question

I want to evaluate different POS tags in NLTK using a text file as an input.

For an example, I will take Unigram tagger. I have found how to evaluate Unigram tag using brown corpus.

from nltk.corpus import brown
import nltk

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
# We train a UnigramTagger by specifying tagged sentence data as a parameter
# when we initialize the tagger.
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
print(unigram_tagger.tag(brown_sents[2007]))
print(unigram_tagger.evaluate(brown_tagged_sents))

It produces an output like below.

[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
0.9349006503968017

In a similar manner, I want to read text from a text file and evaluate the accuracy of different POS taggers.

I figured out how to read a text file and how to apply pos tags for the tokens.

import nltk
from nltk.corpus import brown
from nltk.corpus import state_union

brown_tagged_sents = brown.tagged_sents(categories='news')

sample_text = state_union.raw(
    r"C:\pythonprojects\tagger_nlt\new-testing.txt")
tokens = nltk.word_tokenize(sample_text)

default_tagger = nltk.UnigramTagger(brown_tagged_sents)

default_tagger.tag(tokens)

print(default_tagger.tag(tokens))
[('Honestly', None), ('last', 'AP'), ('seven', 'CD'), ('lectures', None), ('are', 'BER'), ('good', 'JJ'), ('.', '.'), ('Lectures', None), ('are', 'BER'), ('understandable', 'JJ')

What I wanted to have is a score like default_tagger.evaluate(), so that I can compare different POS taggers in NLTK using the same input file to identify the most suited POS tagger for a given file.

Any help will be appreciated.

You need ground-truth tags for your test sentences. Either you use an existing set of tagged sentences (like the Brown corpus you used in the first example), or find some linguist with good knowledge of English who is willing to manually tag your sentences. — lenz, Oct 12 '17 at 21:27
@Yash What you are trying to do is different than what you are doing now. You are passing the command `default_tagger.tag(tokens)` and it tags your raw tokens. You should provide manually tagged data in order to be able to evaluate the tagger. — Mohammed, Oct 12 '17 at 21:27

Nathan McCoy · Accepted Answer · 2017-10-13T12:51:23.550

This questions is essentially a question about model evaluation metrics. In this case, our model is a POS tagger, specifically the UnigramTagger

Quantifying

You want to know "how well" your tagger is doing. This is a qualitative question, so we have some general quantitative metrics to help define what "how well" means. Basically, we have standard metrics to give us this information. They are usually accuracy, precision, recall and f1-score.

Evaluating

First off, we would need some data that is marked up with POS tags, then we can test. This is usually referred to as a train/test split, since some of the data we use for training the POS tagger, and some is used for testing or evaluating it's performance.

Since POS tagging is traditionally a supervised learning question, we need some sentences with POS tags to train and test with.

In practice, people label a bunch of sentences then split them to make a test and train set. The NLTK book explains this well, Let's try it out.

from nltk import UnigramTagger
from nltk.corpus import brown
# we'll use the brown corpus with universal tagset for readability
tagged_sentences = brown.tagged_sents(categories="news", tagset="universal")

# let's keep 20% of the data for testing, and 80 for training
i = int(len(tagged_sentences)*0.2)
train_sentences = tagged_sentences[i:]
test_sentences = tagged_sentences[:i]

# let's train the tagger with out train sentences
unigram_tagger = UnigramTagger(train_sentences)
# now let's evaluate with out test sentences
# default evaluation metric for nltk taggers is accuracy
accuracy = unigram_tagger.evaluate(test_sentences)

print("Accuracy:", accuracy)
Accuracy: 0.8630364649525858

Now, accuracy is an OK metric for knowing "how many you got right", but there are other metrics that give us more detail, such as precision, recall and f1-score. We can use sklearn's classification_report to give us a good overview of the results.

tagged_test_sentences = unigram_tagger.tag_sents([[token for token,tag in sent] for sent in test_sentences])
gold = [str(tag) for sentence in test_sentences for token,tag in sentence]
pred = [str(tag) for sentence in tagged_test_sentences for token,tag in sentence]
from sklearn import metrics
print(metrics.classification_report(gold, pred))

             precision    recall  f1-score   support

          .       1.00      1.00      1.00      2107
        ADJ       0.89      0.79      0.84      1341
        ADP       0.97      0.92      0.94      2621
        ADV       0.93      0.79      0.86       573
       CONJ       1.00      1.00      1.00       453
        DET       1.00      0.99      1.00      2456
       NOUN       0.96      0.76      0.85      6265
        NUM       0.99      0.85      0.92       379
       None       0.00      0.00      0.00         0
       PRON       1.00      0.96      0.98       502
        PRT       0.69      0.96      0.80       481
       VERB       0.96      0.83      0.89      3274
          X       0.10      0.17      0.12         6

avg / total       0.96      0.86      0.91     20458

Now we have some ideas and values we can look at to quantify our taggers, but I am sure you are thinking, "That's all well and good, but how well does it perform on random sentences?"

Simply put, it is what was mentioned in other answers, unless you have your own POS tagged data on sentences we want to test, we will never know for sure!

score 0 · Answer 2 · answered Oct 12 '17 at 21:33

0

You need to read manually tagged data, either by yourself or from other sources. Then follow the way you evaluated the unigram tagger. You do not need to tag your manually tagged data. Suppose your new tagged data is saved in a variable named yash_new_test, then all you need to do is to execute this command:

 `print(unigram_tagger.evaluate(yash_new_test))`

I hope this helps!

answered Oct 12 '17 at 21:33

Mohammed

1,364
5
16
32

I ran your suggestion and it gave me this error. tagged_sents = self.tag_sents(untag(sent) for sent in gold) ValueError: too many values to unpack (expected 2) – Yash Oct 13 '17 at 15:00
You are trying to unpack a dictionary in a wrong way. It is not related to my approach at all. – Mohammed Oct 14 '17 at 08:53

Evaluating POS tagger in NLTK

2 Answers2

Quantifying

Evaluating