8

I started using sklearn.naive_bayes.GaussianNB for text classification, and have been getting fine initial results. I want to use the probability returned by the classifier as a measure of confidence, but the predict_proba() method always returns "1.0" for the chosen class, and "0.0" for all the rest.

I know (from here) that "...the probability outputs from predict_proba are not to be taken too seriously", but to that extent?! The classifier can mistake finance-investing or chords-strings, but the predict_proba() output shows no sign of hesitation...

A little about the context:
- I've been using sklearn.feature_extraction.text.TfidfVectorizer for feature extraction, without, for start, restricting the vocabulary with stop_words, or min/max_df --> I have been getting very large vectors.
- I've been training the classifier on an hierarchical category tree (shallow: not more than 3 layers deep) with 7 texts (manually categorized) per category. It is, for now, flat training: I am not taking the hierarchy into account.

The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
Can this be related? Are the huge vectors at the root of all this?
How do I get meaningful predictions? Do I need to use a different classifier?

Here's the code I'm using:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.externals import joblib

Vectorizer = TfidfVectorizer(input = 'content')
vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings
joblib.dump(Vectorizer, 'Vectorizer.pkl') 
gnb = GaussianNB()
Y = np.array(TargetList) # ~2000 categories 
gnb.fit(vecs.toarray(), Y)
joblib.dump(gnb, 'Classifier.pkl') 
...

#In a different function:
Vectorizer = joblib.load('Vectorizer.pkl')
Classifier = joblib.load('Classifier.pkl')
InputList = [Text] # One string
Vec = Vectorizer.transform(InputList)
Probs = Classifier.predict_proba([Vec.toarray()[0]])[0]
MaxProb = max(Probs)
MaxProbIndex = np.where(Probs==MaxProb)[0][0]
Category = Classifier.classes_[MaxProbIndex]
result = (Category, MaxProb)  

Update:
Following the advice below, I tried MultinomialNB & LogisticRegression. They both return varying probabilities, and are better in any way for my task: much more accurate classification, smaller objects in memory & much better speed (MultinomialNB is lightning fast!).

I now have a new problem: the returned probabilities are very small - typically in the range 0.004-0.012. This is for the predicted/winning category (and the classification is is accurate).

AviM
  • 99
  • 1
  • 5

1 Answers1

15

"...the probability outputs from predict_proba are not to be taken too seriously"

I'm the guy who wrote that. The point is that naive Bayes tends to predict probabilities that are almost always either very close to zero or very close to one; exactly the behavior you observe. Logistic regression (sklearn.linear_model.LogisticRegression or sklearn.linear_model.SGDClassifier(loss="log")) produces more realistic probabilities.

The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.

That's because GaussianNB is a non-linear model and does not support sparse matrices (which you found out already, since you're using toarray). Use MultinomialNB, BernoulliNB or logistic regression, which are much faster at predict time and also smaller. Their assumptions wrt. the input are also more realistic for term features. GaussianNB is really not a good estimator for text classification.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks for the swift & very helpful answer. I followed you advice, and here's a small follow-up: – AviM Aug 12 '13 at 13:41
  • ... both **LogisticRegression** and **MultinomialNB** returned varying probabilities, almost in perfect agreement with each other, though the numbers were really small: typically in the range 0.004-0.012; **SGDClassifier** can return probabilities only for binary estimates; all three were indeed faster (**MultinomialNB** - extremely fast), smaller & more accurate than **GaussianNB**. Two questions: 1. How do I understand these very low probability values? 2. Is there a tool inside _scikit-learn_ for dimensionality reduction, or should I 'play' with _min/max_df_ while monitoring the _score()_? – AviM Aug 12 '13 at 15:46
  • @AviM: if you upgrade to 0.14, then `SGDClassifier` does multiclass probabilities. Ad 1., if LR gives extreme probabilities, then either your classes are very clear-cut, or you need to regularize more. Ad 2., there are many options for dimensionality reduction. For text, there's `TruncatedSVD`, but that won't play nicely with `MultinomialNB`. You can also try feature selection, see the document classification example in the `examples` directory. – Fred Foo Aug 12 '13 at 16:21
  • Sorry for maybe not being so clear: the "low value" probabilities that I was quoting, are the probabilities for the predicted/winning categories, those which came first; and the classification is rather good now. This is why I had been surprised by the low value. Thanks again. – AviM Aug 14 '13 at 12:02
  • I will rephrase it as a question: What do I make of the very low probabilities I've been getting (in both `MultinomialNB` & `LogisticRegression`) ? can I take them seriously? Is it that my categories are 'too close'? Note: the classification is is rather accurate. – AviM Aug 28 '13 at 15:48
  • @AviM: how could they be too close? Extreme probabilities rather indicate that your samples are far away from the decision boundary. – Fred Foo Aug 28 '13 at 15:52
  • Too close to one another, I mean. What do you make of `max(Classifier.predict_proba(Vec)[0])` giving values like 0.001? (`Classifier = MultinomialNB()` / `LogisticRegression()` ) – AviM Aug 29 '13 at 17:12
  • @AviM: you may have an easy problem, or you don't regularize enough. Try tuning `alpha` (NB) or `C` (LR). – Fred Foo Aug 29 '13 at 19:24
  • Thanks, I will try. By the way, I wrongly quoted the number: it is around 0.01, not 0.001. – AviM Aug 31 '13 at 01:38