13

I'm using Postgresql 9.4.5. When I go to psql and run \l I get

Encoding is UTF8
Collate is en_US.UTF-8 
cCtype is en_US.UTF-8

I have products table with a name column that has the following names:

T-700A Grouped
T-700 AGrouped
T-700A Halved
T-700 Whole

When I execute the following SQL in pql

SELECT name FROM products WHERE name LIKE '%T-700%' ORDER By name ASC;

I get the following output

T-700A Grouped
T-700 AGrouped
T-700A Halved
T-700 Whole

That sorting doesn't look natural. I expected to get

T-700 AGrouped
T-700 Whole
T-700A Grouped
T-700A Halved

It doesn't seem like Postgres is handling spaces the way I expected. Can anyone explain what is happening and suggest a way to fix this?

CodeSmith
  • 2,133
  • 1
  • 20
  • 43
  • Might [this](http://stackoverflow.com/questions/7018628/postgresql-sorting-mixed-alphanumeric-data) be of use to you? – Tim Biegeleisen Dec 30 '15 at 22:23
  • Yah, that looks like a similar issue. They don't really solve the issue in a global way though. I'd like to be able to setup the database in a way that all text columns behave the same way by default. – CodeSmith Dec 30 '15 at 22:45
  • 1
    See [Postgres ordering of UTF-8 characters](http://stackoverflow.com/a/32653074/1995738) – klin Dec 31 '15 at 00:40

1 Answers1

27

On Unix/Linux SE, a friendly expert explained that what you see is the proper way to sort Unicode. Basically, the standard is trying to sort:

di Silva Fred                  di Silva Fred
di Silva John                  diSilva Fred
diSilva Fred                   disílva Fred
diSilva John         ->        di Silva John
disílva Fred                   diSilva John
disílva John                   disílva John

Now if spaces were as important as letters, the sort could not separate the various identical spellings of Fred and John. So what happens is that it first sorts without spaces. Then in a second pass, strings that are the same without whitespace are sorted. (This is a simplification, the real algorithm looks fairly complex, assigning whitespace, accents and non-printable characters various levels of precedence.)

You can bypass the Unicode collation by setting:

export LC_ALL=C

Or in Postgres by casting to byte array for sorting:

order by name::bytea

Or (from Kiln's answer) by specifying the C collation:

order by name collate "C"

Or by altering the default collation for the column:

alter table products alter column name type text collate "C";
Community
  • 1
  • 1
Andomar
  • 232,371
  • 49
  • 380
  • 404
  • I've asked the question on the [linux / unix stackexchange](http://unix.stackexchange.com/questions/252419/unexpected-sort-order-in-en-us-utf-8-locale)... perhaps a unix guru knows the answer :) – Andomar Dec 30 '15 at 23:10
  • 2
    This just saved me, what a couple of hours it has been – jdcookie Feb 23 '17 at 06:21
  • 1
    Given solutions work for English, but what if I still need to sort byl locale dependent rules? I need unicode sorting respecting words. "On vacation" must come before "One". – Tomáš Cerha Feb 14 '20 at 10:55
  • Should note that this may have negative performance effect on the index usage. https://dba.stackexchange.com/questions/260267/is-there-any-performance-issue-with-including-the-collate-as-part-of-the-query – Natan Nov 05 '21 at 10:28