Text classification: Naïve Bayes classifier with skewed data distribution

Question

I have a question about Naïve Bayes classifier with skewed data distribution for training and test data.

training data has 90% spam and 10% non-spam
test data has 80% non- spam and 20% spam

Would it be better to use MLE(max. likelihood) than MAP(standard max. posterior probability) for decision function for training data or not?

My understanding is as the distribution of training data and that of test data is different, if we use max. posterior probabilities then test results will be biased towards spam class, So MLE is better. Is my understanding correct?

In practice, it seems very common to ignore any prior probability, and also most terms, and use only the 10 strongest positive and negative signals each. Not because of theory, but because it works better. — Has QUIT--Anony-Mousse, Mar 06 '17 at 07:29
Don't do as @Anony-Mousse suggests. It is certainly not my experience with text classification that you can ignore priors and keep a few signals (on the contrary, except for trivial tasks, you need typically thousands of features). I think the question is: which set better reflects the reality: the the training set, or the test set? Is there any reason why you can't have both sets reflect the real distributions that you will encounter when using the classifier? Because a potential problem I see is incorrect data sampling which has consequences beyond priors calculation — Pascal Soucy, Mar 08 '17 at 20:52

Text classification: Naïve Bayes classifier with skewed data distribution

0 Answers0