I want to evaluate different POS tags in NLTK using a text file as an input.
For an example, I will take Unigram tagger. I have found how to evaluate Unigram tag using brown corpus.
from nltk.corpus import brown
import nltk
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
# We train a UnigramTagger by specifying tagged sentence data as a parameter
# when we initialize the tagger.
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
print(unigram_tagger.tag(brown_sents[2007]))
print(unigram_tagger.evaluate(brown_tagged_sents))
It produces an output like below.
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
0.9349006503968017
In a similar manner, I want to read text from a text file and evaluate the accuracy of different POS taggers.
I figured out how to read a text file and how to apply pos tags for the tokens.
import nltk
from nltk.corpus import brown
from nltk.corpus import state_union
brown_tagged_sents = brown.tagged_sents(categories='news')
sample_text = state_union.raw(
r"C:\pythonprojects\tagger_nlt\new-testing.txt")
tokens = nltk.word_tokenize(sample_text)
default_tagger = nltk.UnigramTagger(brown_tagged_sents)
default_tagger.tag(tokens)
print(default_tagger.tag(tokens))
[('Honestly', None), ('last', 'AP'), ('seven', 'CD'), ('lectures', None), ('are', 'BER'), ('good', 'JJ'), ('.', '.'), ('Lectures', None), ('are', 'BER'), ('understandable', 'JJ')
What I wanted to have is a score like default_tagger.evaluate(), so that I can compare different POS taggers in NLTK using the same input file to identify the most suited POS tagger for a given file.
Any help will be appreciated.