I am implementing Naive Bayes classifier for text category detection. I have 37 categories and I've got accuracy about 36% on my test set.
I want to improve accuracy, so I decided to implement 37 two-way classifiers as suggested in many sources (Ways to improve the accuracy of a Naive Bayes Classifier? is one of them), these classifiers would answer for a given text:
specific_category OR everything_else
and I would determine text's category by applying them sequentally.
But I've got a problem with first classifier, it always fails in "specific_category" category.
I have training data - 37 categories, 100 documents for each category of the same size. For each category I found list of 50 features I selected by mutual information criteria (features are just words).
For the sake of example, I use two categories "agriculture" and "everything_else" (except agriculture).
For category "agriculture":
number of words in all documents of this class
(first term in denominator in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf, (13.7))
W_agriculture = 31649.
Size of vocabulary V_agriculture = 6951.
Log probability of Unknown word (UNK) P(UNK|agriculture) = -10.56
Log probability of class P(agriculture) = log(1/37) = -3.61 (we have 37 categories of same-size documents)
For category "everything_else":
W_everything_else = 1030043
V_everything_else = 44221
P(UNK|everything_else) = -13.89
P(everything_else) = log(36/37) = -0.03
Then I have a text not related to agriculture, let it consist mostly of Unknown words (UNK). It has 270 words, they are mostly unknown for both categories "agriculture" and "everything_else". Let's assume 260 words are UNK for "everything_else", other 10 is known.
Then, when I calculate probabilities
P(text|agriculture) = P(agriculture) + SUM(P(UNK|agriculture) for 270 times)
P(text|everything_else) = P(everything_else) + SUM(P(UNK|everything_else) for 260 times) + SUM(P(word|everything_else) for 10 times)
In the last line we counted 260 words as UNK and 10 as known for a category.
Main problem. As P(UNK|agriculture) >> P(everything_else) (for log it is much greater), the influence of those 270 terms P(UNK|agriculture) outweighs influence of sum for P(word|everything_else) for each word in text. Because
SUM(P(UNK|agriculture) for 270 times) = -2851.2
SUM(P(UNK|everything_else) for 260 times) = -3611.4
and first sum is much larger and can't be corrected not with P(agriculture) nor SUM(P(word|everything_else) for 10 words), because the difference is huge. Then I always fail in "agriculture" category though the text does not belong to it.
The questions is: Am I missing something? Or how should I deal with big number of UNK words and their probability being significantly higher for small categories?
UPD: Tried to enlarge tranining data for "agriculture" category (just concatenating the document 36 times) to be equal in number of documents. It helped for few categories, not much for others, I suspect due to fewer number of words and dictionary size, P(UNK|specific_category) gets bigger and outweighs P(UNK|everything_else) when summing 270 times.
So it seems such method is very sensitive on number of words in training data and vocabulary size. How to overcome this? Maybe bigrams/trigrams would help?