Spamassassin working but not learning? Bayes-filter accuracy is not improving

Question

I have a mailserver with a working spamassassin installation (postfix, dovecot, amavis, clamav, spamassassin on debian).

Everything is working great. Spamassassin is filting out Spams and I get the headers:

X-Virus-Scanned: Debian amavisd-new at xxx.yyy.de
X-Spam-Flag: YES
X-Spam-Score: 8.025
X-Spam-Level: ********
X-Spam-Status: Yes, score=8.025 tagged_above=-9999 required=3
    tests=[BAYES_50=0.8, DKIM_INVALID=0.1, DKIM_SIGNED=0.1,
    HTML_IMAGE_ONLY_24=1.618, HTML_MESSAGE=0.001,
    RAZOR2_CF_RANGE_51_100=1.886, RAZOR2_CHECK=0.922,
    RCVD_IN_BL_SPAMCOP_NET=1.347, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
    URIBL_ABUSE_SURBL=1.25, URIBL_BLOCKED=0.001]
    autolearn=no autolearn_force=no

I am training Spamassassin (currently manually) with new spam and ham when it comes in:

Tue Dec 15 22:22:14 2020
Spam training for xxx@yyy.de
Learned tokens from 12 message(s) (159 message(s) examined)
Ham training for xxx@yyy.de
Learned tokens from 4 message(s) (49 message(s) examined)
Deleting spam for xxx@yyy.de older than 30 days
Syncing the SpamAssassin journal
bayes: synced databases from journal in 0 seconds: 2711 unique entries (2711 total entries)
Statistics for this run:
0.000          0          3          0  non-token data: bayes db version
0.000          0       5288          0  non-token data: nspam
0.000          0        855          0  non-token data: nham
0.000          0     124148          0  non-token data: ntokens
0.000          0 1602145027          0  non-token data: oldest atime
0.000          0 1608066788          0  non-token data: newest atime
0.000          0 1608067345          0  non-token data: last journal sync atime
0.000          0 1607672985          0  non-token data: last expiry atime
0.000          0    5529600          0  non-token data: last expire atime delta
0.000          0      50552          0  non-token data: last expire reduction count
Run finished Tue Dec 15 22:22:27 2020

Everything seems to work. However, I lately found that some spam which always looks the same still comes through into inboxes. It's the same type of spam, and after a few weeks of training it still comes through. The bayesian score doesn't change.

X-Virus-Scanned: Debian amavisd-new at xxx.yyy.de
X-Spam-Flag: NO
X-Spam-Score: 1.852
X-Spam-Level: *
X-Spam-Status: No, score=1.852 tagged_above=-9999 required=3
    tests=[BAYES_00=-1.9, DIGEST_MULTIPLE=0.293, DKIMWL_WL_MED=-0.001,
    DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.249,
    HTML_MESSAGE=0.001, MAILING_LIST_MULTI=-1, PYZOR_CHECK=1.392,
    RAZOR2_CF_RANGE_51_100=1.886, RAZOR2_CHECK=0.922, SPF_HELO_NONE=0.001,
    SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01]
    autolearn=no autolearn_force=no

I can't seem to find any issues and everything I've checked so far seems to be working. The minimum default of 200 Spams/Hams are obviously passed. So that shouldn't be an issue. I am training Spamassassin with this command:

/usr/bin/sa-learn --no-sync --spam /var/vmail/$domain/$user/Maildir/.Junk/{cur,new} >> /var/log/sa-learn.log 2>&1
/usr/bin/sa-learn --no-sync --ham /var/vmail/$domain/$user/Maildir/{cur} >> /var/log/sa-learn.log 2>&1
/usr/bin/sa-learn --sync >> /var/log/sa-learn.log 2>&1

What could be the problem? I'm not sure where to look any more.

Any help is grately appreciated.

One common problem is saving training data into a different database (e.g. by calling as a different user) than the one used in the actual mail server. — anx, Dec 17 '20 at 00:55
That is something I am trying to look into. Where can I find the user that's used when checking and grading emails? — Esprit1st, Dec 24 '20 at 00:55
Well I don't think training data is going into a mysql database. No config in Amavis that pints in that direction, also no mysql-database that has any spam related things in it. So it must be text-based. — Esprit1st, Dec 24 '20 at 01:04
Depending on distribution the bayes database files (no sql used used afaik, but still a database) are stored in a path like `/var/lib/spamassassin/.spamassassin/bayes_toks` and related files. If you have multiple independent instances of that storage.. e.g. saving it for the user you called as, resulting in `/home/user/.spamassassin/` you know your training has no effect on what amavis calls. So.. just `find` files named that way? — anx, Dec 24 '20 at 05:15
OK, I looked into the paths that you mentioned and those don't exist. However, as you suggested, I just searched for .spamassassin and found two other locations: /var/lib/amavis/.spamassassin/ /root/.spamassassin/ The root might be the one that I am training manually since I am doing that as root. However, I found just by looking at the file-dates, that the journal file in the amavis folder got updated when I trained as root. Now I am trying to figure out how to train as amavis, and maybe move the training data from root to amavis. Both folders have bayes_seen, toks and user_prefs. — Esprit1st, Dec 27 '20 at 23:12
Also all files in both folders have recent change dates of just a couple days. So even though I trained as root even the amavis files have some recent data in there. I am a little confused about that. — Esprit1st, Dec 27 '20 at 23:14

anx · Answer 1 · 2020-12-28T02:43:17.663

The results of spamassassin Bayes training are stored in a database made up of some files commonly stored in the home directory of the user it is running under. If you call with a different user, you are not accessing/updating the same dataset.

_{(extended Version of earlier comment)}

For privilege separation, spamassassin usually runs under a separate user, such as debian-spamd or amavis, so during autolearning, the database of that user will be updated. If you wish to make manual updates to the database, you might need to specify the correct user, otherwise you would just be saving your training data to a different, unrelated database.

How to tell? If you have (backups aside) two instances of the training data files, you have been calling spamassassin under two different users (likely one from your mail server, one from your shell):

# find / -name bayes_toks
/var/lib/amavis/.spamassassin/bayes_toks
/root/.spamassassin/bayes_toks

Both files may have a recent modification timestamp because as soon as the database is sufficiently seeded, spamassassin may select sufficiently well identified mail to autotrain which is to learn tokens from received mail without manual action (this behaviour can be configured and you usually want it on).

How to fix? Feed the same mails to the right database - by calling sa-learn with the user/homedir that its using while called from the mail server (verify this, the folder name might not match the username!):

sudo -H -u amavis sa-learn --no-sync --spam /var/vmail/$domain/$user/Maildir/.Junk/{cur,new} >> /var/log/sa-learn.log 2>&1
sudo -H -u amavis sa-learn --no-sync --ham /var/vmail/$domain/$user/Maildir/{cur} >> /var/log/sa-learn.log 2>&1
sudo -H -u amavis sa-learn --sync >> /var/log/sa-learn.log 2>&1

I am not recommending merging the unintentionally split datasets because the internal file format can be a bit confusing (though it can be dumped using --backup and destructively overwritten using --restore), whereas retraining on the same spam data is much simpler and sa-learn is designed to deal with being fed the same mail over and over without adverse effect.

So I tried different users and the only one that didn't throw an error message was vmail. I'll monitor the (hopefully positive) change over the next few days. — Esprit1st, Dec 28 '20 at 22:48
I don't know if it actually works. I don't really have a measure to check. Some of the emails seem to still get the same rating. But thanks for your help! — Esprit1st, Jan 02 '21 at 19:30

Spamassassin working but not learning? Bayes-filter accuracy is not improving

1 Answers1