Naive Bayes Classifier, Explain model fitting and prediction algorithms

Question

I have the two equations below that relate to the model fitting and prediction algorithms of a naive bayes classifier.

I am trying to understand what line 6 of algorithm 3.2 is doing. I think it is trying to make the numbers "nicer" by doing the log-sum-exp trick, which I still don't understand fully. Could someone outline why this is/needs to be done? And specifically what the argument to the
```
logsumexp(L_i,:)
```
means/is/reads as?
Also could someone give me a good notion of what the two values in line 8 of algorithm 3.1 is for? Are they basically initial offsets/biases to the Lic in algorithm 3.2?

enter image description here

From Machine Learning A Probabilistic Prospective Author Kevin P. Murphy

it would be good to cite your source as it is someone else's work you have quoted. — TooTone, Sep 09 '13 at 16:47

score 1 · Answer 1 · edited May 23 '17 at 12:30

Please see below. If you want more details of the mathematics involved you might be better off posting on cross-validated.

Could someone outline why the log-sum-exp trick is/needs to be done?

This is for numerical stability. If you search for "logsumexp" you will see several useful explanations. E.g., https://hips.seas.harvard.edu/blog/2013/01/09/computing-log-sum-exp, and log-sum-exp trick why not recursive. Essentially, the procedure avoids numerical error that can occur with numbers that are too big / too small.

specifically what the argument L_i,: reads as

The i means take the i^th row, and the : means take all values from that row. So, overall, L_i,: means the i^th row of L. The colon : is used in Matlab (and its open source derivative Octave) to mean "all indices" when subscripting vectors or matrices.

could someone give me a good notion of what the two values in line 8 of algorithm 3.1 is for?

enter image description here

This is the frequency that class C appears in the training examples.

enter image description here

Adding a hat indicates that this frequency is to be used as an estimate of the probability of class C appearing in the population as a whole. In terms of Naive Bayes, we can see these probabilities as priors.

And similarly...

enter image description here

An estimate of the probability of the j^th feature appearing when you restrict your attention to class C. These are the conditional probabilities: P(j|c) = probability of seeing feature j given class c -- and the Naive in Naive Bayes means that we assume they are independent.

Note: the quotes from your question have been modified a little for clarity / convenience of exposition.

Edit in reply to your comment

L_i,: is a vector
N is the no of training examples
D is the dimension of the data, i.e. the number of features (each feature is a column in the matrix x, whose rows are training examples).
What is L_i,:? Each L_i,c looks like the log of: the prior for class c times the product of all P(i|c), i.e. the product of conditional probabilities of seeing the features for example i given class c. Note that there are only two entries in the vector L_i,:, one for each class (it's binary classification, so there are just two classes).

Using Bayes Theorem, the entries of L_i,: can be interpreted as the logs of relative conditional probabilities of the training example i being in class c given the features of i (actually they're not relative probabilities, because they each need to be divided by the same constant, but we can safely ignore that).

I'm not sure about line 6 of algorithm 3.2. If all you need to do is figure out which class your training example belongs to, then to me it seems sufficient to omit line 6 and for line 7 use argmax_c L_ic. Perhaps the author included line 6 because p_ic has a particular interpretation?

So then is Li,: a list? Which we do the log sum of the exponents of. It is specifically a list of probabilities of seeing a specific feature per class? - Also, N is the number of features and D is the dimension of the data/features? — , Sep 09 '13 at 18:40
`Li,:` is a vector, and `N` is the no of training examples. But I agree that `D` is the dimension of the data, i.e. the *number* of features (each feature is a column in the matrix `x`, whose rows are training examples). Re what is `Li,:`, each `Lic` looks like the prior for c times the conditional probabilities of seeing the features for example i given class c. Note that there are only *two* entries in the vector `Li,:`, one for each class (it's binary classification, so there are just two classes). — TooTone, Sep 09 '13 at 18:42
For some more on the maths, hopefully your book will give some context; alternatively you could post on cross-validated with what you've learnt here. (Also, I wrote some maths in http://stackoverflow.com/questions/17030793/is-taking-logs-to-vectorize-repeated-multiplication-the-right-approach) — TooTone, Sep 09 '13 at 18:44
`Li,:` is a vector, of probabilities of..? I guess my issue is, in the algorithms `Li,;` is not defined as something, or accumulated into. - So Algorithm 3.2 might read "for each training example, for each class, for each feature, if the data is present in this feature increase the likelihood of this class"? How might the next two lines read? - In your post about does group mean the same thing as class? — , Sep 09 '13 at 18:49
@BumSkeeter I think you're on roughly right lines here. I've updated my answer to make things clearer. And yes, you can read group for class in my [post](http://stackoverflow.com/questions/17030793/is-taking-logs-to-vectorize-repeated-multiplication-the-right-approach). Look at the last line: Lic is the log of the right-hand side of the last equation. Like I said in my answer, I'm not 100% sure of the point of line 6, because it doesn't seem to transform the vector in a way that's going to change which element of the vector is the largest. — TooTone, Sep 09 '13 at 19:11
Lic is the log likelihood of class c plus the log prior. It's written in this way so you can see it's a log linear model (I'm guessing the logistic regression classifier is next presented?). The final step in Line 6 normalises the Lis so they're interpretable as posterior probabilities (but as TooTone says, this doesn't change the maxiumum, since log is a monotonic function) — Ben Allison, Sep 10 '13 at 12:18

Naive Bayes Classifier, Explain model fitting and prediction algorithms

1 Answers1