I just created my own Naive Bayes model from scratch and trained it on 776 documents. I tried classifying the documents but it's classified the documents wrong on all three of the test documents. The category that it should have been even had the lowest of all probabilities against the other categories (this is for two of the three test documents).
Should I increase the number of training documents?
I don't think it's my code because i checked the computation but I don't know, maybe the compute_numerators function is wrong somehow?? For the numerator part I used logs because of the underflow problem and summed up the probabilities of the terms and the probability of (number_of_documents_in_category/overall_number_of_documents)
Super confused and discouraged since this took me so long and now I feel like it was for nothing because it didn't even classify ONE document correctly :(
@Bob Dillon, Hi, thank you for your thorough reply. my biggest question from this was what you mean by separable. Do you mean if there is a clear distinction of the documents between the classes? I don't really know how to answer that. The data was classified by humans so the separation is possible, but maybe it's so close to other types of categories that it's getting blurred? Maybe the computer doesn't recognize a difference in the words used that are classified as one thing vs another category? I have to keep those categories, I cannot rearrange the categories, they must be as is. I am not sure how to prototype in R, wouldn't I still need to take in the text data and run it? wouldn't I still need to create a tokenization etc? I'm going to look into information gain and SVM. I will probably post back. Thanks!