Scikit Naive Bayes Classification for text

Question

i am trying to use scikit for the Naive Basyes classification. i have couple of question (Also i am new to scikit)

1) Scikit Algorithms want input as a numpy array and label as arrays. In case of text classification should i map each of my word with a number (id) , by maintaining a hash of words in vocab and a unique id associated with it? is this is standard practice in scikit?

2) In case of assigning same text to more than one class how should i proceed. One obvious way is to replicate each training example one for each associated label. Any better representation exist?

3) Similarly for the test data how will i get more than one class associated with a test?

I am using http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html as my base.

score 1 · Accepted Answer · answered Dec 07 '13 at 23:20

1

1) yes. Use DictVectorizer or HashVectorizer from the feature_extraction module. 2) This is a multilabel problem. Maybe use the OneVsRestClassifier from the multi_class module. It will train a separate classifier for each class. 3) Using a multilabel classifier / one classifier per calss will do that.

Take a look at http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html and http://scikit-learn.org/dev/auto_examples/plot_multilabel.html

answered Dec 07 '13 at 23:20

Andreas Mueller

27,470
8
62
74

Thanks for the nice answer. I have few questions in the strategy: Let say the if I use OneVsAll/OneVsRestClassifier, wouldn't it be susceptible to the fact that permutation of labels will constitute a different class while it should be same. For example for a particular test let say I have label X,Y,Z now for almost similar text if I have Z,Y,A as label the OneVsAll classifier will be in wrong impression of fitting first text to X while the second text to Z? While permutation does not matter. Will sorting of label help? – David Dec 08 '13 at 08:57
Scikit-learn will take care of that for you, but I would recommend applying LabelBinarizer to your labels, which will create a unique representation. – Andreas Mueller Dec 09 '13 at 00:46

Scikit Naive Bayes Classification for text

1 Answers1