Does naive bayes text classification require real world data

Question

Given that the Bayesian formula is:

P(A|B) = (P(B|A) * P(A)) / P(B)

Lets say that I want to train a classifier to classify spam/ham. Lets say also, that in the real world, we get about 1% spam. So given a sample input, we would expect about 1% spam.

When I am training my classifier, should I train it with documents that contain only 1% spam, or is it ok to train my classifier with a much larger percentage of spam then I would expect to find in the real world.

I ask this, because if I have a much larger percentage of spam, then the value for

P(A)

is going to be abnormally large. Will this throw off my classifier, and in this case would it classify some "ham" documents as "spam"?

lukas · Answer 1 · 2017-11-07T19:33:48.210

To train Bayesian estimator, you need to learn PDFs P(X|H) and P(X|S), where X is your current observation and H,S stands for spam/ham class, each one is trained only from examples of its class, i.e, P(X|H) is learned only from ham samples and P(X|S) is learned only from spam samples. To this point is does not matter much if number of spam vs. ham samples reflects reality. However, later on, to have a proper Bayesian estimate you need to estimate the priors P(H) and P(S) and those should capture the proportion of spam/ham in reality.

Does naive bayes text classification require real world data

1 Answers1