14

I am new to multilingual data and my confession is that I never did tried it before. Currently I am working on a multilingual site, but I do not know which language will be used.

Which collation/character set of MySQL should I use to achieve this?

Should I use some Unicode type of character set?

And of course these languages are not out of this universe, these must be in the set which we mostly use.

Ashish Gupta
  • 2,574
  • 2
  • 29
  • 58
Imran Naqvi
  • 2,202
  • 5
  • 26
  • 53

3 Answers3

22

You should use a Unicode collation. You can set it by default on your system, or on each field of your tables. There are the following Unicode collation names, and this are their differences:

utf8_general_ci is a very simple collation. It just - removes all accents - then converts to upper case and uses the code of this sort of "base letter" result letter to compare.

utf8_unicode_ci uses the default Unicode collation element table.

The main differences are:

  1. utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in the wrong order.

  1. utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are not sorted well.

+/- The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.

So depending on, if you know or not, which specific languages/characters you are going to use I do recommend that you use utf8_unicode_ci which has a more ample coverage.

Extracted from MySQL forums.

Daniel Node.js
  • 6,734
  • 9
  • 35
  • 57
mariana soffer
  • 1,853
  • 12
  • 17
1

UTF-8 encompasses most languages, that's your safest bet. However, there are exceptions, and you need to make sure all languages you want to cover work in UTF-8. My experience with storing character sets MySQL doesn't understand, is that it will not be able to sort properly, but the data has remained intact as long as I read it out in the same character encoding I wrote it in.

UTF-8 is the character encoding, a way of storing a number. Which character is represented by which number is Unicode - an important distinction. Unicode has a large number of languages it covers and UTF-8 can encode them all (0 to 10FFFF, sort of), but Java can't handle all since the VM internal representation is a 16-bit character (not that you care about Java :).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Martin Algesten
  • 13,052
  • 4
  • 54
  • 77
  • How would I go about doing proper sorting in any target language? I'm trying to build a large international application and I really *need* proper sorting. I don't need to use PHP or MySQL, but that's what I'm currently using. – Stephane Mar 02 '13 at 00:39
0

You can insert any language text in MySQL Table by changing the Collation of the table Field to 'utf8_general_ci '.It is case insensitive.

JWC May
  • 605
  • 8
  • 14