I've implemented a fairly efficient implementation of a multinomial Naive Bayes classifier and it works like a charm. That is until the classifier encounters very long messages (on the order of 10k words) where the predictions results are nonsensical (e.g. -0.0) and I get a math domain error
when using Python's math.log
function. The reason why I'm using log is that when working with very small floats, what you get if you multiply them is very small floats and the log helps avoiding infinitely small numbers that would fail the predictions.
Some context
I'm using the Bag of words approach without any sort of vectorization (like TF-IDF, because I can't figure out how to implement it properly and balance 0-occurring words. A snippet for that would be appreciated too ;) ) and I'm using frequency count and Laplace additive smoothing (basically adding 1 to each frequency count so it's never 0).
I could just take the log out, but that would mean that in the case of such long messages the engine would fail to detect them properly anyway, so it's not the point.