Where are specific collations used in a modern utf application

Question

I was wondering if collation is still a big issue when working with databases without legacy data.

For example, for something like BigQuery or Snowflake the character encoding is utf-8. BigQuery actually only supports two collations -- default and case-insensitive 'und:ci'. Snowflake has some additional collations.

In my own usage, I have only found myself using case-sensitive or case-insensitive collation on a string/character column. Are there other uses of where collation may be used? I apologize if this is a naive question (perhaps this is related to my only knowing English and never having to deal much with sorting other languages).

Ok, but there are millions of aplication that need special collation we in Europe and Americas can live with ASCII all our livesy still we use utf8mb4 because it gives us more freedom — nbk, Jun 21 '23 at 06:22
@nbk -- I see. Is that more about the character set or collation though, or both? — David542, Jun 21 '23 at 06:26
@nbk: nooo! ASCII had a mostly careful choice of characters, but they need to choose from many more characters required for good English style. Unfortunately that limitation forced people to reduce the character set, and people start to lost information of that. We lost good "quote charactes". We lost the cent currency symbol. We do no have pound symbol (nor modern Euro symbol). We do not have the division symbol, nor the multiplication symbol. Few people can distinguish different dashes, hyphen and minus characters. — Giacomo Catenazzi, Jun 21 '23 at 07:29
@David542 it is for both as evrry characters set allows the collation, to interpret the data differently — nbk, Jun 21 '23 at 08:52
@GiacomoCatenazzi maybe you didn't understand my comment, where i explicit mention the lost of possibilities, which include emojis and more — nbk, Jun 21 '23 at 08:56

score 2 · Answer 1 · answered Jun 21 '23 at 07:51

It is difficult to answer, but if you ask, probably it doesn't matter for you.

Collation is about ordering stuffs alphabetically (non-numeric). Do it matter for you if a come before or after A, the order of AaA, aBA, ABa, etc? (111 is before AAA or after ZZZ?) (and about accented characters? Near base character, or within symbols?). On most application we do no care: at most we want a consistent ordering. Phone book had different ordering then most dictionaries. So there is no single collation for a single language). And between languages there are strange rules (ll in Spanish, å in Danish without forgetting å is also a unit symbol).

Also to make thing more complex: now an application may be multilingual, so a single collation for database is not enough. And probably not per table or per field. So now it is good to select the collation at query time (so with language of the user), but that break indices (you cannot make an index before knowing the ordering). Or we just use Unicode collation algorithm, which it is easier to understand (and without many historical exceptions). It work well for most languages.

So, it is up to you. You are doing an online dictionary and in several languages? So you need a language specific collation, and as people expect on a dictionary. Else: it doesn't matter so much. We now uses more searches then indices (but for search, we normalize strings for searches, so with less surprises with accents).

So, if you do not have some particular need, uses the default, or the Unicode default collation. If people complain, then you know about the need of better collation, and you should also have more information (about use cases). But I would not over-engineer for a case probably nobody uses or cares (and so with eventual slow down on indexing).

Freeman · Answer 2 · 2023-06-30T20:05:21.720

Alright, my friend, I'll do my best to explain how it's used with some examples that I know!

collation in SQL gives us the power to customize sorting, searching, and comparison in our own way. We can control case sensitivity, accent sensitivity, use linguistic rules, handle Unicode normalization, and even optimize performance. It's all about making sure our data is handled just right for our specific needs and languages!

As I told you before one of them is case-sensitivity,collation can actually decide if sorting and searching are gonna care about upper or lower case.it totally affects how we handle those big and small letters, something like this:

SELECT name FROM users ORDER BY name COLLATE Latin1_General_CS_AS;

as you see it will sort the names in a way that pays attention to upper and lower case, so it treats those big and small letters as totally different!

SELECT name FROM users WHERE name = 'Test' COLLATE Latin1_General_CI_AS;

and this will search for the name 'Test' in a case-insensitive manner, treating uppercase and lowercase letters as equal.

Next use of Collation is accent sensitivity,collation can determine whether accents or diacritical marks are considered significant or ignored during sorting and searching,this is particularly useful for languages that use accents. for example:

SELECT name FROM users ORDER BY name COLLATE Latin1_General_CI_AI;

this will sort the names in a case-insensitive manner, treating accents as significant!

SELECT name FROM users WHERE name = 'Cafe' COLLATE Latin1_General_CI_AI;

this will search for the name Café in a case-insensitive manner, treating the accent as significant!

next is linguistic rules,collation can actually include these specific rules for different languages or regions,it's all about how we handle those special characters and sort things out based on the language, something like this:

SELECT name FROM users ORDER BY name COLLATE French_CI_AS;

this will sort the names using French linguistic rules, such as treating é as distinct from e!

SELECT name FROM users WHERE name = 'café' COLLATE French_CI_AI;

this will search for the name café in a case-insensitive manner, treating the accent as significant, according to French linguistic rules!

moving on to unicode normalization,collation can deal with different forms of normalization, like NFC or NFD,this makes sure that we compare and sort characters consistently, even if they have multiple representations,check this out:

SELECT name FROM users ORDER BY name COLLATE utf8_unicode_ci;

this will sort the names using a case-insensitive collation with Unicode normalization!

SELECT name FROM users WHERE name = 'café' COLLATE utf8_unicode_ci;

this will search for the name café in a case-insensitive manner, considering the Unicode normalization of the characters!

next is,performance optimization,so, when it comes to choosing the right collation, it can really make a difference in how fast we can sort and search stuff.different collations have their own performance traits, you know? Like, using a binary collation (_bin) can give us quicker sorting and searching compared to those linguistic collations.but, here's the catch,we might have to give up some language-specific behavior.check this out:

SELECT name FROM users ORDER BY name COLLATE utf8_bin;

this will sort the names in a case-sensitive manner, based on the binary representation of the characters!

SELECT name FROM users WHERE name = 'Test' COLLATE utf8_bin;

this will search for the name 'Test' in a case-sensitive manner, based on the binary representation of the characters!

there could be some other ways to use it too, but these are the ones I've looked into and mentioned for you, buddy. Just wanted to give you a heads up!

sure but this is my questions itself. Outside of using a case-insensitive or case-sensitive collation -- what are the uses of collation? — David542, Jun 30 '23 at 18:25
@David542 After doing a ton of searching and reading articles on Google and other places, I completely rewrote my entire answer for you! — Freeman, Jun 30 '23 at 20:12

score 1 · Answer 3 · answered Jun 30 '23 at 11:43

According to Character Sets and Collations in General

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.

In real life, most character sets have many characters: not just A and B but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules, not just for whether to distinguish lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German Ö), and for multiple-character mappings (such as the rule that Ö = OE in one of the two German collations).

In the past, I've used sql to retrieve rows that order by pronunciation of one column.

I was using Chinese, but for simplicity, I will try to use English letter to explain how to order by pronunciation.

id	name	Pronouncing
1	A	ei
2	B	bi
3	C	si
4	D	di
5	E	i:

select * from user order by Pronouncing;

/* expected B, D, A, E, C */

If we set a pronunciation collation(I don`t know if it exists) to column name, then we can directly get the correct result which is

B, D, A, E, C

thanks! This is a good example. – David542 Jun 30 '23 at 18:26 — David542, Jun 30 '23 at 18:26

Where are specific collations used in a modern utf application

3 Answers3