1

MultinomialNB from scikit-learn is giving unexpected outputs for some simple use cases.

import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

docs = ['aa','aa','','','aa']
y    = np.array([1,1,1,1,2])

vec = CountVectorizer()
X = vec.fit_transform(docs)
clf = MultinomialNB(alpha=1e-10).fit(X, y)
X_test = vec.transform(['aa'])
print(clf.predict_proba(X_test)[0, 0])
# >> 0.8

when I would expect 0.4 given that I understand P(y|X) to be approximated in this case as P(x='aa'|y=1) * P(y=1) = 2/4 * 4/5 = 0.4. So P(X) is ignored

Now, the documentation and this stackoverflow discussion make me think the calculation is really P(x='aa'| all words in y=1 samples) * P(y=1) = 2/2 * 4/5 = 0.8.

However, this doesn't work if you run the same example above but with docs = ['aa','aa','bb','bb','aa'], which gives 0.6666667 as an output.

In this new case, I would expect both formulas to give (2/4 * 2/4) * 4/5 = 0.2. I have (2/4 * 2/4) because the P(X|y) term, under the independence assumption, is the product of both having "aa" and not having "bb" when y=1.

Any clarification is appreciated.

LexTron
  • 35
  • 1
  • 5

0 Answers0