How many documents to train on for naive bayes?

Question

I just created my own Naive Bayes model from scratch and trained it on 776 documents. I tried classifying the documents but it's classified the documents wrong on all three of the test documents. The category that it should have been even had the lowest of all probabilities against the other categories (this is for two of the three test documents).

Should I increase the number of training documents? I don't think it's my code because i checked the computation but I don't know, maybe the compute_numerators function is wrong somehow?? For the numerator part I used logs because of the underflow problem and summed up the probabilities of the terms and the probability of (number_of_documents_in_category/overall_number_of_documents)

Super confused and discouraged since this took me so long and now I feel like it was for nothing because it didn't even classify ONE document correctly :(

@Bob Dillon, Hi, thank you for your thorough reply. my biggest question from this was what you mean by separable. Do you mean if there is a clear distinction of the documents between the classes? I don't really know how to answer that. The data was classified by humans so the separation is possible, but maybe it's so close to other types of categories that it's getting blurred? Maybe the computer doesn't recognize a difference in the words used that are classified as one thing vs another category? I have to keep those categories, I cannot rearrange the categories, they must be as is. I am not sure how to prototype in R, wouldn't I still need to take in the text data and run it? wouldn't I still need to create a tokenization etc? I'm going to look into information gain and SVM. I will probably post back. Thanks!

Well what are you comparing it against to know that it is wrong? Plus the magic number in stats is 30. As long as you have more than 30 observations your sample size should be large enough. — FirebladeDan, Aug 05 '15 at 21:19
Well the test document is already classified, I just want to see if it's been classified correctly. And I compare the predicted with the actual. Yea I have 776 so definitely covering the 30 base, but maybe text classification requires much more?? — hope288, Aug 05 '15 at 21:31
So don't worry about your sample you're good. It still doesn't make sense what engine you're comparing to. Is your trained set different from one made in R or Matlab. I'm still confused how you know it's wrong. — FirebladeDan, Aug 05 '15 at 21:39
@FirebladeDan citation for "magic number in stats is 30"? That doesn't sound right at all. — IVlad, Aug 05 '15 at 21:53
@IVlad - I will take that citation seeing as your credentials are superior. Good input dlow — FirebladeDan, Aug 05 '15 at 22:01
Yes, that's what I mean by separate. If the data is human classified, then I would focus on dimensionality reduction. Humans are great at disregarding useless information and focussing on valuable information. ML gets confused by useless information (noise, like the word "the") so feature selection helps a lot here. — Bob Dillon, Aug 07 '15 at 11:10
Here's the tool that I have tried in R: http://www.rtexttools.com/ — Bob Dillon, Aug 07 '15 at 11:11
They're easy to get started with: you import text documents and the tools go from there. But I found them tricky to get results. This tool is very powerful and lets you experiment with a variety of ML techniques to see which gets best results on your data. — Bob Dillon, Aug 07 '15 at 11:25
A good intro to the R tools: http://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf — Bob Dillon, Aug 07 '15 at 11:25

IVlad · Answer 1 · 2015-08-05T21:58:36.047

I just created my own Naive Bayes model from scratch and trained it on 776 documents

Naive Bayes, likes its name says, is a naive algorithm. It's very bad compared to modern methods, like support vector machines or (deep) neural networks. You should keep this in mind when using it: expect better results than tossing a coin would give you, but not by very much.

tried classifying the documents but it's classified the documents wrong on all three of the test documents

Only three test documents? This is very little, and tells you nothing. If you have x total documents, you should use at least 20% for testing. Also consider using cross validation.

Should I increase the number of training documents?

This will help, yes. A golden rule of thumb in machine learning is that more data will usually beat a better algorithm. Of course, we can't always get more data, or we can't afford the processing power to use more data, so better algorithms are important.

To be able to see an improvement though, you'll need to use more testing data as well.

In conclusion: test on more data. If you have 779 documents, use at least 100 for testing or do cross validation. If you get above 50-60% accuracy, be happy, that's good enough for this amount of data and Naive Bayes.

@lVlad, thank you for your reply. I increased the testing to about 400 but only 3 were correct :( I am increasing the training set to 6000 and going to use 1200 documents for testing set. I am also looking into SVM, I will update soon. Thank you again! — hope288, Aug 06 '15 at 04:56

score 3 · Answer 2 · answered Aug 06 '15 at 03:10

You have a lot working against you.

Weak dimensionality reduction - stop word filtering only
Multi-class classification
Weak classifier
Little training data

You're showing us the code that you're using, but if the data is not separable, then nothing will sort it. Are you sure that the data can be classified? If so, what performance do you expect?

You should try prototyping your system before jumping to implementation. Using Octave, R or MatLab is a good place to start. Make sure your data is separable and the algorithm is effective on your data. Others have suggested using SVM and Neural Nets rather than Naive Bayes classification. That's a good suggestion. Each takes a bit of tweaking to get best performance. I've used Google Prediction API as a first order check of the performance that I can expect from a system and then replace it with SVM or another classifier to optimize performance and reduce cost/latency/etc. It's good to get a baseline as quickly and easily as possible before diving too deep.

If the data is separable, the more help you give the system the better it will perform. Feature/dimensionality reduction removes noise and helps the classifier to perform well. There is statistical analysis that you can do to reduce feature set. I like Information Gain, but there are others.

I found this paper to be a good theoretical treatment of text classification, including feature reduction.

I've been successful using Information Gain for feature reduction and found this paper to be a very good practical guide.

As for the amount of training data, that is not so clear cut. More is typically better, but the quality of the data is very important too. If the data is not easily separable or the underlying probability distribution is not similar to your test and wild data then performance will be poor even with more data. Put another way, quantity of training data matters, but quality matters at least as much.

Good luck!

How many documents to train on for naive bayes?

2 Answers2