I have two classes of sentences. Each has reasonably distinct pos-tag sequence. How can I train a Naive-Bayes classifier with POS-Tag sequence as a feature? Does Stanford CoreNLP/NLTK (Java or Python) provide any method for building a classifier with pos-tag as a feature? I know in python NaiveBayesClassifier
allows for building a NB classifier but it uses contains-a-word
as feature but can it be extended to use pos-tag-sequence as a feature ?
Asked
Active
Viewed 3,394 times
6

kundan
- 1,278
- 14
- 27
-
Do you really need to use NaiveBayesClassifier? Have you looked at CRF? By the way, have you read this chapter: http://www.nltk.org/book/ch06.html ? – 404pio Feb 27 '15 at 16:13
-
Thanks for the link. I ended up using concatenated pos-tags-sequence and `containsPosSequence` as a feature... – kundan Mar 04 '15 at 11:00
1 Answers
6
If you know how to train and predict texts (or sentences in your case) using nltk's naive bayes classifier and words as features, than you can easily extend this approach in order to classify texts by pos-tags. This is because the classifier don't care about whether your feature-strings are words or tags. So you can simply replace the words of your sentences by pos-tags using for example nltk's standard pos tagger:
sent = ['So', 'they', 'have', 'internet', 'on', 'computers' , 'now']
tags = [t for w, t in nltk.pos_tag(sent)]
print tags
['IN', 'PRP', 'VBP', 'JJ', 'IN', 'NNS', 'RB']
As from now you can proceed with the "contains-a-word" approach.

char bugs
- 419
- 2
- 8
-
Adding to your answer, as the question says "sequence", we can chain the pos tags of the sentence like `[IN][PRP][VBP][JJ][IN][NNS][RS]` and define a feature like say `conatinsPrpVbpSequence` and set it to `True` for an occurance of `[PRP][VBP]`.... – kundan Mar 05 '15 at 06:22