6

I think I've implemented most of it correctly. One part confused me:

The zero-frequency problem: Add 1 to the count for every attribute value-class combination (Laplace estimator) when an attribute value doesn’t occur with every class value.

Here's some of my client code:

//Clasify
string text = "Claim your free Macbook now!";
double posteriorProbSpam = classifier.Classify(text, "spam");
Console.WriteLine("-------------------------");
double posteriorProbHam = classifier.Classify(text, "ham");

Now say the word 'free' is present in the training data somewhere

//Training
classifier.Train("ham", "Attention: Collect your Macbook from store.");
*Lot more here*
classifier.Train("spam", "Free macbook offer expiring.");

But the word is present in my training data for category 'spam' only not in 'ham'. So when I go to calculate posteriorProbHam what do i do when I come across the word 'free'.

enter image description here

Science_Fiction
  • 3,403
  • 23
  • 27

1 Answers1

6

Still add one. The reason: Naive Bayes models P("free" | spam) and P("free" | ham) as being completely independent, so you want to estimate the probability of each completely independently. The Laplace estimator you're using for P("free" | spam) is (count("free" | spam) + 1) / count(spam); P("ham" | spam) is the same.

If you think about what it would mean to not add one, it wouldn't really make sense: seeing "free" one time in ham would make it less likely to see "free" in spam.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Danica
  • 28,423
  • 6
  • 90
  • 122
  • Thanks. I've just edited to include the formula I am following. So for example P(viagra|Spam), if the training data has 0 count for viagra in the category 'spam', I should just add 1? – Science_Fiction Aug 28 '12 at 08:52
  • If you want to use Laplacian smoothing, add one to *all* of the numerators and denominators, not just zero-counts. So if you had 10 free|spam, 5 free|non-spam, 50 spam total, 100 non-spam total, you'd estimate `P(free|spam) = (10+1)/(50+1)`, `P(spam) = (50+1)/(150+1)`, `P(free) = (15+1)/(150+1)`. You could also use a number smaller than 1 (e.g. 0.1, typically called "alpha", as it corresponds to using a [Dirichlet-alpha](http://en.wikipedia.org/wiki/Dirichlet_distribution) distribution as your [prior](http://en.wikipedia.org/wiki/Prior_probability) on these probabilities.) – Danica Aug 28 '12 at 13:43
  • Yeah, that's what i ended up doing. Things look good some times, however others i end up with probabilities greater than 1. Looking at the formula above, this is easily possible depending on the result of the denominator. – Science_Fiction Aug 28 '12 at 14:29
  • @Science_Fiction Do you mean `P(spam | word1, word2, ...) > 1`? I might be wrong, but I don't think that should happen... It is true that e.g. `\sum_w P(w | spam)` will be greater than 1, though. – Danica Aug 28 '12 at 16:39
  • 3
    your issue is happening because your denominator is incorrect: if you add one to all the counts, the denominator is count(spam) + v, where there are v words in your vocabulary. Not count(spam)+1 as suggested here: if you do this, the sum over all words is greater than one. Won't affect the decision, but will mess up any probabilities you try to calculate – Ben Allison Mar 29 '13 at 09:17