1

I'm new to NLTK and still pretty new to python. I want to use my own dataset to train and test NLTK's Perceptron tagger. The training and testing data has the following format (it's just saved in a txt file):

Pierre  NNP
Vinken  NNP
,       ,
61      CD
years   NNS
old     JJ
,       ,
will    MD
join    VB
the     DT
board   NN
as      IN
a       DT
nonexecutive    JJ
director        NN
Nov.    NNP
29      CD
.       .

I want to call these functions on the data:

perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(train_data)
accuracy = perceptron_tagger.evaluate(test_data)

I've tried a few things but I just can't figure out what format the data is expected to be in. Any help would be appreciated! Thanks

alvas
  • 115,346
  • 109
  • 446
  • 738
ellen
  • 571
  • 7
  • 23

1 Answers1

2

The input for train() and evaluate() functions of the PerceptronTagger requires a list of list of tuples, where each inner list is a list each tuple is a pair of string.


Given train.txt and test.txt:

$ cat train.txt 
This foo
is  foo
a   foo
sentence    bar
.   .

That    foo
is  foo
another foo
sentence    bar
in  foo
conll   bar
format  bar
.   .

$ cat test.txt 
What    foo
is  foo
this    foo
sentence    bar
?   ?

How foo
about   foo
that    foo
sentence    bar
?   ?

Read the files in CoNLL format into list of tuples.

# Using https://github.com/alvations/lazyme
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]

# Or otherwise

>>> def per_section(it, is_delimiter=lambda x: x.isspace()):
...     """
...     From http://stackoverflow.com/a/25226944/610569
...     """
...     ret = []
...     for line in it:
...         if is_delimiter(line):
...             if ret:
...                 yield ret  # OR  ''.join(ret)
...                 ret = []
...         else:
...             ret.append(line.rstrip())  # OR  ret.append(line)
...     if ret:
...         yield ret
... 
>>> 
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> tagged_test_sentences
[[('What', 'foo'), ('is', 'foo'), ('this', 'foo'), ('sentence', 'bar'), ('?', '?')], [('How', 'foo'), ('about', 'foo'), ('that', 'foo'), ('sentence', 'bar'), ('?', '?')]]

Now you can train/evaluate the tagger:

>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]
>>> from nltk.tag.perceptron import PerceptronTagger
>>> pct = PerceptronTagger(load=False)
>>> pct.train(tagged_train_sentences)
>>> pct.tag('Where do I find a foo bar sentence ?'.split())
[('Where', 'foo'), ('do', 'foo'), ('I', '.'), ('find', 'foo'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'foo'), ('sentence', 'bar'), ('?', '.')]
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> pct.evaluate(tagged_test_sentences)
0.8
alvas
  • 115,346
  • 109
  • 446
  • 738