My spamassassin setup is very old and has been migrated across two machines and multiple mysql reinstalls. I recently noticed that when running sa-learn
on messages, I'd get a slew of errors like the following in my logs:
bayes: _put_token: SQL error: Incorrect string value: '\x8A\x98s\x9A\xC8' for column 'token' at row 1
I examined the database and did some research and found this very old bugreport that identified the issue: my schema is (very!) out-of-date. The column bayes_token.token
is CHAR(5)
collation utf8_general_ci, where as of oh, 9 years ago, it should be BINARY(5)
.
I can't simply convert the column-type, because many of the UTF8 5-char values are longer than 5 bytes (the table has ~110K rows).
My question is: is there any method of truncating the over-long tokens that will keep them valid in the bayes classifier? Or failing that can I drop only those rows so I can convert the remainder to binary?
UPDATE: I did manage to transfer the column contents into an added BINARY(5)
column in the table (here named token2
) as follows:
UPDATE bayes_token SET token2 = CONVERT(token USING latin1);
This is as close as I can figure to "walking back" what happened to the tokens when inserted. However, there are many duplicates in the resulting column, and as the token
column is part of the primary key, that's no good.
I think the data can be conserved, but maybe not in pure SQL. I need to:
- Create a new copy of the empty table from the up-to-date schema
- group all duplicates of each token and sum those rows'
ham_count
andspam_count
values and maxatime
value - insert these aggregated results into the new table
- replace the original table with the new one