Can I usefully convert spamassassin bayes tokens from utf8 CHAR(5) to BINARY(5)?

Question

My spamassassin setup is very old and has been migrated across two machines and multiple mysql reinstalls. I recently noticed that when running sa-learn on messages, I'd get a slew of errors like the following in my logs:

bayes: _put_token: SQL error: Incorrect string value: '\x8A\x98s\x9A\xC8' for column 'token' at row 1

I examined the database and did some research and found this very old bugreport that identified the issue: my schema is (very!) out-of-date. The column bayes_token.token is CHAR(5) collation utf8_general_ci, where as of oh, 9 years ago, it should be BINARY(5).

I can't simply convert the column-type, because many of the UTF8 5-char values are longer than 5 bytes (the table has ~110K rows).

My question is: is there any method of truncating the over-long tokens that will keep them valid in the bayes classifier? Or failing that can I drop only those rows so I can convert the remainder to binary?

UPDATE: I did manage to transfer the column contents into an added BINARY(5) column in the table (here named token2) as follows:

UPDATE bayes_token SET token2 = CONVERT(token USING latin1);

This is as close as I can figure to "walking back" what happened to the tokens when inserted. However, there are many duplicates in the resulting column, and as the token column is part of the primary key, that's no good.

I think the data can be conserved, but maybe not in pure SQL. I need to:

Create a new copy of the empty table from the up-to-date schema
group all duplicates of each token and sum those rows' ham_count and spam_count values and max atime value
insert these aggregated results into the new table
replace the original table with the new one

As far as I understand it, these are supposed to be binary strings. The bug I linked to above is about the CHAR column type being mistakenly used in the table schema (until the bug was fixed). I think the problem is SA has expected that column to be BINARY ever since the fix, but in my case it has remained CHAR. I'm kinda hoping there may be some SA veterans around who went through this issue themselves at the time. — Headbank, Aug 06 '20 at 09:19

score 0 · Answer 1 · answered Aug 06 '20 at 05:12

It sounds like the text could be different encodings. You can't put different encodings into a single column unless you first convert to, say, utf8mb4. But, if you don't know what the encoding is, that is is not practical.

If your goal is simply to pass the bytes along, I would use VARBINARY(..) or BLOB so that you don't stumble over charset issues.

In MySQL, foo VARCHAR(5) CHARACTER SET utf8 can take up to 15 bytes. For utf8mb4, up to 20 bytes. So VARBINARY(20), with no CHARACTER SET.

If your table currently has foo in it, declared that way, do

ALTER TABLE t
    MODIFY COLUMN foo BINARY(20);

(Tack on NULL or NOT NULL as appropriate.)

The problem is the data going in was always (I believe) starting out as 5-byte binary strings. As I (barely) understand it, because the db treated this as UTF8 text on insertion, many of the tokens ended up longer than that. I need to ultimately end up with a BINARY(5) column but I'm not sure whether there's any way of doing this without rendering the tokens meaningless to SA. — Headbank, Aug 06 '20 at 10:38

score 0 · Accepted Answer · answered Aug 07 '20 at 13:58

I believe I've reconstituted the data as best I could, as follows.

Add a column token2 BINARY(5) NOT NULL to table bayes_token
UPDATE bayes_token SET token2 = CONVERT(token USING latin1);
Create a new table bayes_token2 using the modern schema
INSERT INTO bayes_token2 SELECT '1' AS id, token2 AS token, SUM(spam_count) AS spam_count, SUM(ham_count) AS ham_count, MAX(atime) AS atime FROM bayes_token GROUP BY token2;
Delete table bayes_token and replace with bayes_token2

The vast majority of the data were unique and not exceeding 5 bytes in UTF8 anyway, but I think the way I've done this has saved the remainder in the correct way too.

Can I usefully convert spamassassin bayes tokens from utf8 CHAR(5) to BINARY(5)?

2 Answers2