3

I am using Naive Bayes in text classification.

Assume that my vocabulary is ["apple","boy","cup"] and the class label is "spam" or "ham". Each document will be covered to a 3-dimentional 0-1 vector. For example, "apple boy apple apple" will be converted to [1,1,0]

Now I have calculate the conditional probability p("apple"|"spam"), p("apple"|"ham"), p("boy"|"spam")...etc from training examples.

To test whether a document is spam or ham, like "apple boy" -> [1,1,0], we need to compute p(features | classLabel)

Use conditional independence,for test vector [1,1,0]

I know the two formulas

(1) p(features|"ham") = p("apple"|"ham")p("boy"|"ham")

(2) p(features|"ham") = p("apple"|"ham")p("boy"|"ham")(1-p("cup"|"ham"))

which formula is right?

I believe that (2) is right because we have 3 features (actually 3 words in vocabulary). But I see codes written by others using (1). Although the term 1-p("cup"|"ham") is nearly 1 so it won't make too much difference, but I want the exact answer.

CAFEBABE
  • 3,983
  • 1
  • 19
  • 38
  • This problem raise up when I am reading [Machine Learning in Action](https://www.manning.com/books/machine-learning-in-action) about machine learning code in python. I think the author may not understand these two formula very well. – Rongshen Zhang Jan 05 '16 at 02:11
  • I found detailed discuss in this problem in Machine Learning course by Andrew Ng. read [lecture notes](http://cs229.stanford.edu/notes/cs229-notes2.pdf) for detail.Both the formulas are correct, but the "feature" they refer to are quite different. They come from different models. – Rongshen Zhang Jan 05 '16 at 07:21
  • Can you point me to the page in Machine Learning in Action? Would be curious. – CAFEBABE Jan 05 '16 at 09:14
  • Read page 67-73 in MLiA. The code is right(expect for Laplace smooth, which should be #word rather than 2, in my opinion) .But the author didn't discuss it very well. He used the first formula but explained it as the second formula. – Rongshen Zhang Jan 05 '16 at 13:14

1 Answers1

2

Your intuition is right and probably also the code you wrote. However, your problem is in the notation. (I need to admit that in the beginning it is pretty tough to wrap the head around it.) The most important concept you are missing are random variables (RV)

I use HAM, CUP, BOY and HAM as random variables. There are two possible events each of the RVs can take either contains(c) or not contains (nc). The probability that a text contains boy can then be written as P(BOY=contains) and that it does not contain the word is P(BOY=not contains)= 1-P(BOY=contains)

In turn the correct formula is then

P(FEATURES| HAM) =  P(CUP,BOY,APPLE|HAM) = P(CUP|HAM)P(BOY|HAM)P(APPLE|HAM)

Where the last step is due to the naive Bays assumption. To calculate the probability you asked for you need to calculate

 P(BOY=c,APPLE=c,CUP=nc|HAM) = P(BOY=c|HAM)P(APPLE=c|HAM)P(CUP=nc|HAM) 
                             = P(BOY=c|HAM)P(APPLE=c|HAM)(1-P(CUP=c|HAM))

Actually these are still two probabilities (which do NOT sum up to one) as HAM can take up two values.

CAFEBABE
  • 3,983
  • 1
  • 19
  • 38
  • Nice argument! Your argument is right given that the feature is a word vector(which consists of all of our vocabulary). In fact these two formulas are both correct, but the "feature" they refer to are different. They come from different models, although they all need naive assumptions. Machine Learning course by Andrew Ng has discuss this problem. [course lecture notes](http://cs229.stanford.edu/notes/cs229-notes2.pdf) – Rongshen Zhang Jan 05 '16 at 01:58
  • Can you point me to the section in Andrew Ng's lecture notes you are referring to? In strict mathematical notation it is almost impossible that the first formula is correct for your task. – CAFEBABE Jan 05 '16 at 09:13
  • Section 2 of [lecture-notes-2](http://cs229.stanford.edu/notes/cs229-notes2.pdf).You can just press 'CTRL+F' to search for "naive bayes" – Rongshen Zhang Jan 05 '16 at 13:06