MultinomialNB
from scikit-learn is giving unexpected outputs for some simple use cases.
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
docs = ['aa','aa','','','aa']
y = np.array([1,1,1,1,2])
vec = CountVectorizer()
X = vec.fit_transform(docs)
clf = MultinomialNB(alpha=1e-10).fit(X, y)
X_test = vec.transform(['aa'])
print(clf.predict_proba(X_test)[0, 0])
# >> 0.8
when I would expect 0.4 given that I understand P(y|X) to be approximated in this case as P(x='aa'|y=1) * P(y=1) = 2/4 * 4/5 = 0.4
. So P(X) is ignored
Now, the documentation and this stackoverflow discussion make me think the calculation is really P(x='aa'| all words in y=1 samples) * P(y=1) = 2/2 * 4/5 = 0.8
.
However, this doesn't work if you run the same example above but with docs = ['aa','aa','bb','bb','aa']
, which gives 0.6666667 as an output.
In this new case, I would expect both formulas to give (2/4 * 2/4) * 4/5 = 0.2
. I have (2/4 * 2/4)
because the P(X|y) term, under the independence assumption, is the product of both having "aa" and not having "bb" when y=1.
Any clarification is appreciated.