-1

I've implemented a fairly efficient implementation of a multinomial Naive Bayes classifier and it works like a charm. That is until the classifier encounters very long messages (on the order of 10k words) where the predictions results are nonsensical (e.g. -0.0) and I get a math domain error when using Python's math.log function. The reason why I'm using log is that when working with very small floats, what you get if you multiply them is very small floats and the log helps avoiding infinitely small numbers that would fail the predictions.

Some context

I'm using the Bag of words approach without any sort of vectorization (like TF-IDF, because I can't figure out how to implement it properly and balance 0-occurring words. A snippet for that would be appreciated too ;) ) and I'm using frequency count and Laplace additive smoothing (basically adding 1 to each frequency count so it's never 0).

I could just take the log out, but that would mean that in the case of such long messages the engine would fail to detect them properly anyway, so it's not the point.

Nocturn9X
  • 25
  • 7

1 Answers1

2

There is no multiplication in Naive Bayes if you apply log-sum-exp, only additions, so the underflow is unlikely. And if you use smoothing (as you say yo do), you would never get undefined behavior for log.

This stats stackexchange answer describes the underlying math. For reference implementation I have a snippet of mine lying around for MultinomialNaiveBayes (analogous to sklearn's sklearn.naive_bayes.MultinomialNB and with similar API):

import numpy as np
import scipy


class MultinomialNaiveBayes:
    def __init__(self, alpha: float = 1.0):
        self.alpha = alpha

    def fit(self, X, y):
        # Calculate priors from data
        self.log_priors_ = np.log(np.bincount(y) / y.shape[0])

        # Get indices where data belongs to separate class, creating a slicing mask.
        class_indices = np.array(
            np.ma.make_mask([y == current for current in range(len(self.log_priors_))])
        )
        # Divide dataset based on class indices
        class_datasets = np.array([X[indices] for indices in class_indices])

        # Calculate how often each class occurs and add alpha smoothing.
        # Reshape into [n_classes, features]
        classes_metrics = (
            np.array([dataset.sum(axis=0) for dataset in class_datasets]).reshape(
                len(self.log_priors_), -1
            )
            + self.alpha
        )

        # Calculate log likelihoods
        self.log_likelihoods_ = np.log(
            classes_metrics / classes_metrics.sum(axis=1)[:, np.newaxis]
        )

        return self

    def predict(self, X):
        # Return most likely class
        return np.argmax(
            scipy.sparse.csr_matrix.dot(X, self.log_likelihoods_.T) + self.log_priors_,
            axis=1,
        )

BTW. -0.0 is exactly the same as 0.0 and is sensical value.

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
  • From what I've been able to understand, it looks like my floats get too small (below `sys.float_info.min`) which causes an underflow. I'll investigate anyway. Oh and as a Side note, I don't use numpy (because I really don't know how to use it) could you show me what do you mean with _"There is no multiplication in Naive Bayes if you use log-sum-exp"_? – Nocturn9X May 19 '20 at 12:11
  • If you add floats together they won't overflow. 10k words is nowhere near that amount. Please notice Naive Bayes doesn't use __any__ multiplication (only one division where you get classes prior probability which won't underflow either) – Szymon Maszke May 19 '20 at 12:13
  • Did you multiply log transformed probabilities or did you add them? – Szymon Maszke May 19 '20 at 12:14
  • Naive Bayes does multiply probabilities from what I've studied. You take each word's probability and multiply it with all other probabilities in your sentence and then multiply that by the prior probability, isn't that it? – Nocturn9X May 19 '20 at 12:16
  • 1
    Yes, but it makes your implementation underflow. If you multiply `10.000` small values by each other they will become smaller than machine precision. That's why you take `log` of each probability and sum them together. As you take `argmax` and this transformation doesn't change the maximum it works. See [the answer I've linked to](https://stats.stackexchange.com/questions/105602/example-of-how-the-log-sum-exp-trick-works-in-naive-bayes/253319#253319) – Szymon Maszke May 19 '20 at 12:20
  • So I should rather take `sum(log(p) for p in probs)` rather than `log(prod(p for b in probs))`, right?. **EDIT**: I did edit the incriminating line of code in my implementation and it seems to keep all testing values (accuracy, fscore, precision and recall) consistent with the log and multiply version. Will now test with the bigger dataset that caused the issue – Nocturn9X May 19 '20 at 12:23
  • @Nocturn9X yes, exactly – Szymon Maszke May 19 '20 at 12:38
  • 1
    Yeah, now it works with a surprisingly high accuracy! Never seen such a good confusion matrix tbh, thanks a lot :) – Nocturn9X May 19 '20 at 12:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/214167/discussion-between-nocturn9x-and-szymon-maszke). – Nocturn9X May 19 '20 at 12:42