1

I'm trying to understand about Bayes based spam detection, and have difficulty understanding how to code it. To understand it, I'm reading code of SpamAssassin like below. http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Bayes/CombineChi.pm?view=markup

But, I could not understand how the chi2q function.

# Chi-squared function (API changed; see comment above)
107 sub chi2q {
108   my ($x2, $halfv) = @_;    
109 
110   my $m = $x2 / 2.0;
111   my ($sum, $term);
112   $sum = $term = exp(0 - $m);
113   
114   # replace 'for my $i (1 .. (($v/2)-1))' idiom, which creates a temp
115   # array, with a plain C-style for loop
116   my $i;
117   for ($i = 1; $i < $halfv; $i++) {
118     $term *= $m / $i;
119     $sum += $term;
120   }
121   return $sum < 1.0 ? $sum : 1.0;
122 }

I tried to google or read book, but cannot find full explanation including from theory to code.

Can you explain why it works?

sawa
  • 165,429
  • 45
  • 277
  • 381
Tsuneo Yoshioka
  • 7,504
  • 4
  • 36
  • 32

1 Answers1

1

The Chi Squared test can tell if two sets of numbers are "similar"

The best explaination I could find with googling quickly was here http://formulas.tutorvista.com/math/chi-square-formula.html

This involves finding the difference between an observed value and the expected value. Or the value in a different condition. Then the difference is squared. Squaring it has two effects, the squared numbers become positive and any differences are accentuated.

Then all the numbers found with this difference and squaring operation are added up and this makes a number. The number, together with the "degrees of freedom" in the observations is compared on a table to find the "p value" or probability of the result occurring by chance

It allows a match of similarity on two sets of values, without them being exactly the same

I'm sure you can imagine how useful this sort of comparison can be for detecting spam

Your code sample does not seem to do this, so I can only guess that there are other calculations happening in the rest of the spamassassin code base

Vorsprung
  • 32,923
  • 5
  • 39
  • 63
  • How can one calculate or compare p-value, or build p-value table programmatically ? I saw some information that using logarithm is needed to avoid cancellation of significant digits. But, I don't have concrete idea. Is there any actual sample implementation ? – Tsuneo Yoshioka Nov 27 '13 at 11:32
  • This page http://easycalculation.com/statistics/p-value-t-test.php has a p value calculator. If you want to see how it works, the javascript code is there in the source code to the page – Vorsprung Nov 27 '13 at 11:46
  • Again, the javascript code is not very direct implementation or easy to understand for me. There looks so many magic numbers... – Tsuneo Yoshioka Nov 27 '13 at 15:34
  • I've given you an overview explanation, links to pages that explain the process and some example code. I believe that the answer does address your question. If you believe the answer is not sufficient then you need to adjust your question. For example, the question has no expected input and expected output, you don't state *why* you need to know more about the method and there are no details on what exactly it is that puzzles you. Hope this helps – Vorsprung Nov 28 '13 at 10:34
  • Thanks. And, I just want to learn how to write bayesian filter for spam detection from scratch, without using some blackbox that I cannot understand. I have no clue how to get such a competence. So, I just started from reading SpamAssassin's code. That's why I'm asking. Hope it make clear... – Tsuneo Yoshioka Nov 28 '13 at 12:21
  • You probably want to look at http://search.cpan.org/search?query=bayesian&mode=all and in particular http://search.cpan.org/~gslin/Algorithm-Bayesian-0.5/lib/Algorithm/Bayesian.pm which looks like an easy to follow module, doesn't seem to use chi squared things! – Vorsprung Nov 28 '13 at 13:28