4

I am trying the PostgreSQL database for the first time, after having worked for some time with MySQL. My environment is a shared hosting with cPanel and phpPgAdmin. One thing that puzzles me is the database collation. My hosting's cPanel always creates databases with the Encoding, Collation, and Character Type set to UTF8, en_US.UTF-8, and en_US.UTF-8 respectively. I don't seem to have any way to change it, because the database is created via cPanel, there are no options there, and according to this answer, these parameters can only be changed by re-creating the database with the desired settings.

So I was wondering: does it actually matter? If the collation is set to en_us.UTF8, what will happen with non-English and even non-Latin strings, such as Russian or Hebrew? How will they be sorted?

Update: I am puzzled because in MySQL I used to just choosing the utf8mb4_unicode_ci collation and not care about a specific language. I wonder how it works in comparison to country-language-specific collations in PostgreSQL.

texnic
  • 3,959
  • 4
  • 42
  • 75
  • 1
    @zerkms: collation (sort order) is language dependent. For example, in my native tongue, there is a letter between _n_ and _o_. – ninjalj Oct 03 '15 at 22:46
  • @ninjalj oh gosh, I've read the "How will they be sorted?" as "How will they be **stored**?", hence made an accent on the charsets. – zerkms Oct 03 '15 at 22:58

1 Answers1

5

The collation behaviour will depend on the underlying operating system's locale support. In general I think you should expect fairly "dumb" sorting, probably by Unicode codepoint, for characters outside the current locale. Probably the default collatoin for the unicode collation algorithm.

I'm having a hard time quickly finding sample data that collates differently in order to test that.

Note, though, that you can declare collations on a per-column or per-operation basis with the COLLATE term, e.g.

CREATE TABLE sometable(
   ...,
   companyname text COLLATE "ru_RU",
   ...
);

or

SELECT *
FROM sometable
ORDER BY companyname COLLATE "ru_RU"

or

SELECT *
FROM sometable
WHERE companyname < 'Компания' COLLATE "ru_RU"

so you aren't limited to the database's collation.

(Doing this is really the only way to properly handle data in assorted/mixed languages, since there's no one correct collation for all the data anyway).

Edit showing an example:

test=> SHOW lc_collate;
 lc_collate 
------------
 en_AU.utf8
(1 row)

test=> SELECT * FROM (VALUES ('z'),('aa')) x(y) ORDER BY y;
 y  
----
 a
 aa
 z
(3 rows)

craig=> SELECT * FROM (VALUES ('z'),('aa')) x(y) ORDER BY y COLLATE "da_DK";
 y  
----
 a
 z
 aa
(3 rows)

This also shows that the concept of a language-independent collation is total nonsense, and what people really usually mean is "English-like collation" or "collate by Unicode ordinal" (which is in mostly English-like order).

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • 1
    From what I can tell, in _da_DK_, _nb_NO_, and _nn_NO_, _aa_ is treated as an additional letter after _z_. – ninjalj Oct 04 '15 at 10:21
  • Thanks for the answer, that already helps. I wonder if you could also comment on the differences from MySQL's utf8mb4_unicode_ci collation. It seems to be language-independent, doesn't it? – texnic Oct 04 '15 at 11:25
  • 1
    @texnic At a guess that's probably an implementation of the generic unicode collation. See the linked article above. It must be modified to ignore case, which means it isn't fully language independent as different languages have different concepts of case. It's probably case insensitive for English only but I really don't know. MySQL docs might have more. I don't know MySQL's collation features well. I usually find that when people say 'language-independent' they really mean 'follows English rules' though... – Craig Ringer Oct 04 '15 at 11:35
  • 1
    @texnic: nop, that's a generic collation. For some languages, it has to be tailored, e.g: traditional spanish (a long time ago), had _ch_ and _ll_ letters (which sorted as separate letters after _c_ and _l_, respectively). See http://userguide.icu-project.org/collation for some language-specific examples. – ninjalj Oct 04 '15 at 11:38
  • Somehow trying the example above fails: doing `SELECT * FROM names ORDER BY name_text COLLATE "ru_RU";` I receive: `SQL error: ERROR: syntax error at or near "COLLATE"` – texnic Oct 04 '15 at 11:42
  • @texnic You're probably using an old version of PostgreSQL – Craig Ringer Oct 04 '15 at 11:45
  • Indeed, it's version 8.4! Need to check with my hosting provider. Thanks! – texnic Oct 04 '15 at 11:48
  • @texnic Wow. cPanel phpPgAdmin and 8.4. Run, don't walk, to a different hosting provider. – Craig Ringer Oct 04 '15 at 11:53
  • 1
    @texnic They're very limited. I guess it depends on your needs. 8.4 for a new signon is unacceptable though. – Craig Ringer Oct 04 '15 at 12:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/91293/discussion-between-texnic-and-craig-ringer). – texnic Oct 04 '15 at 12:13