2

I have been working on getting SpamAssassin up and running for awhile now and am pretty close to being finished. However, there is one last thing that is grinding away at me that I can't seem to figure out. I have searched around a bit but have been unable to find an answer that I find to be conclusive, so I just want a little clarity so I can sleep better at night.

I have read that SpamAssassin needs at least 200 messages, preferably 1000 to do an effective job of Bayesian filtering. I have been feeding it spam (at least I think) by issuing the following command:

sa-learn --showdots --mbox --spam spamfolder

As far as I can tell it is being processed by SpamAssassin. So I run:

sa-learn --dump magic

and get the following output:

bruticus@bruticus:~$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0        306          0  non-token data: nspam
0.000          0        210          0  non-token data: nham
0.000          0      68430          0  non-token data: ntokens
0.000          0 1318421928          0  non-token data: oldest atime
0.000          0 1319141693          0  non-token data: newest atime
0.000          0 1319142287          0  non-token data: last journal sync atime
0.000          0 1319142287          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count

Are the items in the nspam and nham column indicative of the actual amount of learning and messages that SpamAssassin is using for its Bayesian analysis?

Do I need to get these two sets of numbers up into the 1,000's to get SpamAssassin to really start doing its job or how do I know when I have fed it enough spam to start working correctly?

jmreicha
  • 790
  • 1
  • 16
  • 29

1 Answers1

4

You always need Spam and Ham samples. By only feeding Spam SpamAssassin refuses to activate the bayesian Spam filter.

By issuing a spamassassin -D < /path/to/a/complete.mail you can check if bayesian filtering is activated or not (somewhere in the whole debug messages).

Hopefully you didn't train SpamAssassin with old Spam (months old). It will only work well if you used recent Spam you (personally or as a company) got in the past. If you don't have Ham or Spam samples right now you should better set SA to autolearn. Then the filter gets trained over time. This takes longer and you can't see the benefit right now, but the outcome will impress you in the end.


Yes, your numbers show the "current" learned messages. If these numbers are greater than 200 you are finished. Everything above just makes it "safer" as in "more valid" or "accurate". With auto-learning these numbers will increase over time and also decrease as statistics of old mails will be dropped over time.

mailq
  • 17,023
  • 2
  • 37
  • 69
  • Yes I have fed the filter recent ham and spam samples. Are the nham and nspam indicative of how much of each type of mail that SA is aware of? I am not having any issues feeding it, I just want to know of a way to check the training process. – jmreicha Oct 21 '11 at 01:57
  • @jmreicha You can find out the current numbers by `sa-learn --backup`. The first 3 rows show you the number of Spams (num_spam) and Ham (num_nospam). Edit: Oh, I see. they match with your output. – mailq Oct 21 '11 at 08:35
  • I think we're on the same page. The only thing I am interested in knowing is if nspam and nham rows of my output posted above are accurate as far as SA is concerned. That way I will know how much spam/ham to keep feeding. – jmreicha Oct 21 '11 at 13:03
  • Yes. Both commands show the same numbers. You are correct and I am correct. – mailq Oct 21 '11 at 13:04