0

Well, I wrote this code to classify my data. My data is of 5000 instances and 260 features. Each feature is binomial, i.e. if word "money" is in the instance that I am categorizing, then feature 23 is 1, otherwise 0 etc. There are 4 categories. when I compute the final classes, there is 57% error. In most cases, the desired probability P(y=c|x) is 0 for all c. Even in the correct ones, the maximum of this value is e.g. P(y=1, x) = e^-80 but the others are even smaller so class 1 is selected which is true. So the problem is I guess the values are too small. How can I solve this? I've seen that working with logarithmic probabilities may be better but how can I implement this logarithmically? Thank you in advance.

I am putting the code as an appendix if there is anything missing or wrong is the code. labels = the data classes, normalized features = the data where rows are the instances and the columns are features. Thanks again.


code_part1 , code_part2

Alp
  • 63
  • 6
  • http://stats.stackexchange.com/questions/31891/why-does-my-naive-bayes-classifier-only-give-me-probabilities-near-0 – Brendan Frick Dec 07 '16 at 20:04
  • Use `log` probabilities instead. Also, it would be nice if you attached the actual code to your post rather than images. That kind of behaviour is frowned upon here as no one can run your code to reconstruct your problem unless they do some sort of OCR... and that itself is more effort than what most people are willing to put in. – rayryeng Dec 07 '16 at 20:35
  • thank you very much. Actually I posted the photos since I've found putting the code here harder with indenting every line etc. and the code is not coloured. I thought it would be easier for others to read. Anyway thank you again. With taking log of just the final stage gave much better result. Also there was a small mistake in the beginning while finding the count of every class. So now it works around 92% accuracy (just the training data for now). Have a nice day. – Alp Dec 07 '16 at 21:06
  • @Alp Could you write an answer that provides what you did in detail? It may help with others who have the same problem. – rayryeng Dec 07 '16 at 22:03
  • Rather than multiplying the likelihoods, which was the reason of reaching numbers too small, I have implemented log posterior. i.e. pxisc(1) = prod ( sum( log(p_c1) + log(px_c1.^(x_test)) + log((1-px_c1).^(1-x_test))) ); to calculate the count of the class instances, I changed count_label1 = sum(labels(labels==1)) to count_label1 = sum(labels==1); since the other one calculates not the count but the sum of the values. Other than that, the count of each data class is different. One has 1350 instances while another has 1086 as an example. For better accuracy, I took 1000 of each. – Alp Dec 08 '16 at 21:36

0 Answers0