I am trying to Classify a sample using Naive Bayes. My sample size is 2.8million records, 90% of the records have Class Label(dependent variable) = "0" and the rest have it as "1". The distribution in the testing set is also the same(90% - 10%) The Naive Bayes Classifier labels the entire testing set to "0". How do I deal with this case? Are there any other Algorithms which can be implemented in such cases.

- 67,214
- 13
- 180
- 245

- 1,733
- 6
- 31
- 40
-
1In practice the independent feature assumption does not usually hold, so NB is a bit hit and miss as to what it works with. I've recently completed a project using [random forests](http://en.wikipedia.org/wiki/Random_forest), which performed significantly better than NB. – Tim Nov 19 '13 at 10:41
-
this question can also be asked at http://stats.stackexchange.com/ – alko Nov 27 '13 at 08:31
3 Answers
Your problem may or may not be solved by using a better classifier. The issue here is that your problem is unbalanced. If the data is non-separable then 90% accuracy might represent good performance, which the classifier achieves by always making the same prediction. If this is not the behaviour you want, you should make use of a cost function or resample from your positives so that you have a more even number of positives.

- 1,670
- 11
- 17
There are dozens of classifiers, including:
- Logistic regression
- SVM
- Decision tree
- Neural Network
- Random forest
- many, meny more...
most of which can handle class disproportions using some custom technique, for example in SVM it is a "class weighting" (avaliable in scikit-learn).
So why does NB fail? Naive Bayes is very Naive, it assumes independence of each feature, which is rarely the case, so it is just a simple idea to understand, but very weak classifier in general.

- 64,777
- 8
- 131
- 164
Almost all classification methods actually don't return a binary result, but a propensity score (usually between 0 and 1) of how likely the given case falls within the category. Binary results are then created by picking a cut-off point, usually at .5.
When you want to identify rare cases using weak predictors any classification method may be unable to find cases with a propensity score higher than .5 resulting in all 0s as in your case.
There are 3 things you can do in such a situation:
- I recommend finding stronger predictors if at all possible
- A different statistical method like may be better at identifying patterns in your data set
- Lowering the cut-off point will increase the number of true positives at the expense of more false positives

- 2,162
- 2
- 18
- 23