0

I am new at nltk library and I try to teach my classifier some labels with my own corpus.

For this I have a file with IOB tags like this :

How O 
do B-MYTag
you I-MYTag
know O
, O
where B-MYTag
to O
park O
? O

I do this by:

self.classifier = nltk.MaxentClassifier.train(train_set, algorithm='megam', trace=0)

and it works.

How to train my classifier with negative cases?

I would have similar file with IOB tags, and I would specified that this file is set wrong. (Negative weights)

How can I do this?

Example for negative case would be:

How B-MYTag 
do O
you O
know O
, O
where B-MYTag
to O
park O
? O

After that, I expect to remember that How is probably not a MYTag... The reason for this is, classifier to learn faster.

If I could just type the statements, program would process it and at the end ask me if I am satisfied with the result. If I am, this text would be added to the train_set, if not it would be added to the negative_train_set.

This way, it would be easier and faster to teach classifier the right stuff.

double-beep
  • 5,031
  • 17
  • 33
  • 41
Marko Zadravec
  • 8,298
  • 10
  • 55
  • 97
  • 1
    Can you give an example of the negative cases? I doubt this works (conceptually) with sequence tagging. I mean, what do you expect to learn from wrong annotations? The positive (B/I) and negative (O) classes are already represented in the given annotations. – lenz Feb 09 '17 at 23:32
  • Your edit sounds like you're after active-learning workflow. Of course you can do that manually: Let the classifier predict something, correct the labels manually, add it to the training set, retrain. You have to be specific about the correction: If you only say "the labels in this sentence are wrong", how should the classifier know that the first three tags are bad, but eg. the fourth (`O` on "know") is correct? – lenz Feb 10 '17 at 10:24
  • Please note that if you have a large training set to begin with and then manually add a few manually corrected examples as I just suggested, the impact might be very small. – lenz Feb 10 '17 at 10:27

1 Answers1

1

I'm guessing that you tried a classifier, saw some errors in the results, and want to feed back the wrong outputs as additional training input. There are learning algorithms that optimize on the basis of which answers are wrong or right (neural nets, Brill rules), but the MaxEnt classifier is not one of them. Classifiers that do work like this do all the work internally: They tag the training data, compare the result to the gold standard, adjust their weights or rules accordingly, and repeat again and again.

In short: You can't use incorrect outputs as a training dataset. The idea doesn't even fit the machine learning model, since training data is by assumption correct so incorrect inputs have probability zero. Focus on improving your classifier by using better features, more data, or a different engine.

alexis
  • 48,685
  • 16
  • 101
  • 161
  • Thank you. After some thinking, I see that MaxEnt really can't work with the wrong dataset, because it can't calculate entropy from "wrong" data. Thank you for your answer. Right now I store all new "statements" in file with IOB tags, correct them by hand and insert it to the training corpus. You write something about neural nets. Is this usable with nltk? I have no knowledge about them,... do you have some good links to check them out? – Marko Zadravec Feb 10 '17 at 19:28
  • The nltk's defalt POS tagger is `PerceptronTagger()`. See [here](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python). – alexis Feb 10 '17 at 21:21