3

I am fighting with collations. I want to build a table with a unique index on word that is case insensitive but that differenciate "a" from "à". It means that "de" and "dé" should be 2 distinct entries but de, De, DE,dE should be the same.

Is there an utf8 collation that do the job (it does not seems to be the case) ? Is there a way to say to mysql not to use any collation ? Is there another solution ? I don't want to change my words to lower or upper case

thanks

Sunny
  • 11
  • 1
  • 4

1 Answers1

5
  1. Is there an utf8 collation that do the job (it does not seems to be the case) ?

    None of the collations that ship with MySQL will do this: the utf8_*_ci collations will not differentiate between accent variations and the utf8_bin collation will differentiate between lettercase.

  2. Is there a way to say to mysql not to use any collation ?

    A collation is always required for textual comparisons (it is meaningless to compare text without one); to avoid using a collation altogether you would have to cast your text to binary strings (e.g. using the BINARY operator), in which case strings are compared by their binary encoding.

    However, this will result in differences even where the exact same character has been encoded in different ways (it can happen!). You should therefore use instead the utf8_bin collation, in which characters are compared by their code-point (but, as mentioned above, this will differentiate on lettercase).

  3. Is there any solution ? I don't want to change my words to lower or upper case

    I would recommend using the UPPER() or LOWER() functions to change the case of your expressions (note the underlying data remains unchanged), then use the utf8_bin collation to perform comparisons. Thus, different lettercase will be equivalent but all other differences (including accents, but also contractions and expansions) will not. For example:

    SELECT col FROM my_table WHERE UPPER(col) LIKE UPPER('de') COLLATE utf8_bin
    

    See it on sqlfiddle.

    Alternatively, you can add your own collation, but (unless you recompile MySQL) you would have to define it relative to utf8_unicode_ci which will require resetting the weight of every accent variation - a lot of effort (although you might find you can obtain it from someone who has already done it, or that you can create the rules programatically).

eggyal
  • 122,705
  • 18
  • 212
  • 237
  • Thank you for your answer wich is quite clear. May be the UPPER is the best solution I need to check wich amount of code I will need to change. By going a little bit further I saw that the wright collation for my table is latin1_general_ci but I need to have other tables in utf8 and I can't manage to make work my app with one table in utf8 and one table in latin1. I am using rails and if I say that the connection to my database is in utf8 all the data coming from utf8 tables are ok but data coming from my latin1 table is wrong. Any idea ? – Sunny May 23 '12 at 10:06
  • @Sunny: MySQL should handle conversion from the table character set to your connection character set without problem; since that's not happening, it is likely that the data in your `latin1` table is not actually encoded in `latin1`. You will first need to determine in what encoding that data has been stored, then you can convert it. – eggyal May 23 '12 at 10:18
  • it's what I thought but it was not what I observed. I will try once again to fight with the encoding of this table to be sure it's store in real latin1. I'll let you know. Creating a new collation from unicode is also a good idea, not sure there is so many accent, I think it's around a hundred and you don't need to recompile mysql – Sunny May 23 '12 at 10:36
  • Ok you were right, the table was double badly encoded^(and not only one as I thought) so I can now mix the 2 collations. Thanks for your help – Sunny May 23 '12 at 10:45
  • Ok in fact it seems that I am almost f**** my table contains utf8 characters like chinese or arabic one so I have to use utf8, the upper(col) method break the search with the index so it is inneficient. Two solutions : building my own collation including only the main european accent, adding another column where everything is lower case and using an index on that col with a collation like utf8_bin – Sunny May 23 '12 at 14:38