0

We are successfully using MATCH AGAINST in queries to search in our database, which is mostly in Czech, so we use utf8_czech_ci as default collation. We have set minimum length of query to 1 and we have disabled all the stop words.

However, consider searching for word Schedule.

When you write:

  • s : Schedule found
  • sc : nothing found
  • sch : Schedule found

It looks like it treats ch as single character (which is correct in Czech language), but certainly incorrect when we do fulltext search.

Is there a way to avoid this behaviour?

Vojtěch
  • 11,312
  • 31
  • 103
  • 173

1 Answers1

1

Yes, utf8_czech_ci treats ch as a single letter, between h and i. Č and č are equal, but come after all c. Similarly for other letters with a Caron.

This provides the collation quirks of various utf8 collations.

I would argue that your observations are correct for that collation. Is "schedule" a Czech word?

To avoid it, pick another utf8 COLLATION for the column, and rebuild the FULLTEXT index. utf8_bin and utf8_general_ci and utf8_unicode_ci are likely candidates. You may need to have two columns (and indexes) with the same text, but different collations. Then pick the column in order to control what language you want to search with.

Are you "comparing" strings? If so, the collation will make a big difference -- "say" < "see" < "sch" in Czech, but not any(?) other collation.

(utf8mb4 operates the same as utf8, at least with respect to this Question.)

Rick James
  • 135,179
  • 13
  • 127
  • 222