0

Okay, not the clearest title ever; feel free to improve.

I have a table representing thousands of linguistic forms. Many of these make heavy use of diacritics, so all of aha, áha̱ and ā̧́ḫà̀ may appear. The table (and the database) uses UTF-8 as character set and utf8mb4_unicode_520_ci as the default collation scheme, since searching should be case- and diacritic-agnostic (so searching for aha should bring up all three). These forms have all been entered in manually by human beings, though, so there are inevitably duplicates.

I’m currently trying to get a list of exactly identical forms in order to get rid of duplicates (manually – each token would have to be checked before being removed), but in this case I need to search in a diacritic-aware manner – that is, given the three tokens listed above, I would expect a search to yield no results, since they are three different forms because of the diacritics.

I figured this should be a fairly easy task; just do:

SELECT token FROM table GROUP BY token HAVING COUNT(token) > 1 COLLATE utf8mb4_bin

But alas, that does not work. Instead, it gives me an error message that “COLLATION utf8mb4_bin is not valid for CHARACTER SET latin1”. I should note that I have absolutely nothing Latin-1 anywhere – no character sets, no collations, no server charsets, nothing. There are also no stored procedures or anything else where Latin-1 might creep in.

No, this is because of this bug, which is apparently fixed from 5.7 onwards; see the description at the bottom:

For constructs such as ORDER BY numeric_expr COLLATE collation_name, the character set of the expression was treated as latin1, which resulted in an error if the collation specified after COLLATE is incompatible with latin1. Now when a numeric expression is implicitly cast to a character expression in the presence of COLLATE, the character set used is the one associated with the named collation.

Unfortunately, I’m on 5.6, and I don’t have the option of upgrading (annoyingly). Converting the data to Latin-1 is also not an option, nor is changing the collation on the table.

Is there a way to run my query or an equivalent one yielding the result set I’m after, without getting the collation error?

Janus Bahs Jacquet
  • 859
  • 1
  • 11
  • 27
  • 1
    You could use [HEX](https://dev.mysql.com/doc/refman/5.6/en/hexadecimal-literals.html), so `select token from table group by hex(token) having count(*)>1` – Luuk Oct 06 '20 at 18:33
  • @Luuk Clever! As a workaround, I’d simply duplicated the table and changed the collation scheme on the duplicate, but that doesn’t work if you want it to be more than a one-off, and preferably this should be user-initiable, so that an admin can remove duplicates through a web interface. – Janus Bahs Jacquet Oct 06 '20 at 18:36
  • I don't see a numeric_expr; is that SELECT missing something? – Rick James Oct 06 '20 at 22:23
  • Many characters would probably be unavailable in latin1. Sticking a COLLATE clause on is a possibility. Luuk's suggestion is _probably_ equivalent to `COLLATE utf8mb4_bin`. – Rick James Oct 06 '20 at 22:24
  • `COUNT(*)` should be the same as `COUNT(token)`. Can you provide a self-contained test case that demonstrates the problem? – Rick James Oct 06 '20 at 22:26
  • @RickJames The numeric expression is `COUNT(*) > 1` (doesn’t matter whether it’s counting `*` or `token`). I’m not sure why, but the result of `COUNT()` (which is an integer) for some reason gets converted to a string somewhere along the line, and up until 5.6, that conversion always happened with charset Latin-1, regardless of context charset or collation. At its base, according to the bug log, a simple `SELECT 2 > 1 COLLATE utf8_bin` will trigger the bug. – Janus Bahs Jacquet Oct 06 '20 at 22:27
  • See also https://stackoverflow.com/questions/28677662/why-am-i-seeing-collation-xxx-is-not-valid-for-character-set-yyy/28932003?noredirect=1#comment113587223_28932003 – Rick James Oct 06 '20 at 22:54

1 Answers1

0
SET NAMES utf8mb4;
CREATE TABLE x(s VARCHAR(11) COLLATE utf8mb4_unicode_520_ci NOT NULL);
INSERT INTO x (s)
    VALUES ('aha'), ('áha̱'), ('ā̧́ḫà̀'),
    ('i'), ('i̯');
SELECT s FROM x GROUP BY s HAVING COUNT(*) > 1;

Comes back with

aha i

without any complaints about numeric stuff.

I ran it on 5.6.46, 5.7.26, 8.0.16 and several MariaDB versions.

What I am doing differently than your case?

When adding an explicate COLLATE clause, put it on the component of the query that needs it. (COLLATE does not apply to the query as a whole; different parts can be collated differently.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • You’re not explicitly collating in the `SELECT` statement – for whatever reason, that seems to be what makes the difference. My tables have `utf8mb4_unicode_520_ci` as their collation because that’s what they normally need; needing `utf8mb4_bin` is more of a one-off (or at least rarer) need, so I can’t simply set the table to have the right collation before the `SELECT` statement. – Janus Bahs Jacquet Oct 06 '20 at 23:00
  • Oh bugger. I just realised that I did leave something out of the query in the question – I forgot the explicit collation! Added in now. – Janus Bahs Jacquet Oct 06 '20 at 23:02
  • @JanusBahsJacquet - Add to `COLLATE` clause to the part you want collated -- not the `HAVING COUNT(token) > 1`. – Rick James Oct 08 '20 at 22:24
  • Oh, I thought `COLLATE` applied to the statement as a whole. Didn’t realise you could apply it to individual parts! – Janus Bahs Jacquet Oct 08 '20 at 22:26
  • @JanusBahsJacquet - Ok, I added to my Answer. – Rick James Oct 08 '20 at 22:33