3

Let's try this simple example to represent the problem I'm facing.

Assume this table:

CREATE TABLE testing1
(
    id serial NOT NULL,
    word text,
    CONSTRAINT testing1_pkey PRIMARY KEY (id)
);

and that data:

insert into testing1 (word) values ('Heliod, God');
insert into testing1 (word) values ('Heliod''s Inter');
insert into testing1 (word) values ('Heliod''s Pilg');
insert into testing1 (word) values ('Heliod, Sun');

Then I want to run this query to get the results ordered by the word column:

SELECT
    id, word
FROM testing1
WHERE UPPER(word::text) LIKE UPPER('heliod%') 
ORDER BY word asc;

But look at the output, it's not ordered. I would expect the rows to be in that order, using their ids: 2, 3, 1, 4 (or, if I use the word's values: Heliod's Inter, Heliod's Pilg, Heliod, God, Heliod, Sun). This is what I get:

query 1

I thought that maybe something could confuse postgresql because of the WHERE criteria I used, but the below happens if I just order by on the rows:

query 2

Am I missing something here? I couldn't find anything in the docs about ordering values that contain quotes (I suspect that the quotes cause that behaviour because of their special meaning in postgresql, but I may be wrong).

I am using UTF-8 encoding for my database (not sure if it matters though) and this issue is happening on Postgresql version 12.7.

The output of show lc_ctype; is

"en_GB.UTF-8"

and the output of show lc_collate; is

"en_GB.UTF-8"

babis21
  • 1,515
  • 1
  • 16
  • 29
  • 3
    In `psql` do: `show lc_ctype; show lc_collate;` and add output to your question. Best guess it is the locale sorting rules at work. – Adrian Klaver Aug 08 '21 at 00:16
  • Thanks for your reply! I've updated the question to include that information. How could I see those sorting rules for that locale? Is this documented somewhere? – babis21 Aug 09 '21 at 00:03
  • 1
    That depends. The default Postgres locale collation depends on the OS collation provider, say glibc. You would need to find out your provider and then look up its rules. I have never done that so I'm not going to be of much help there. FYI, in Postgres 10+ there is the ability to use `ICU` library, see [here](https://www.postgresql.org/docs/current/collation.html) 23.2.2. Managing Collations, provided the Postgres instance was built with `ICU` configured. – Adrian Klaver Aug 09 '21 at 14:43

2 Answers2

4

That is the correct way to order the rows in en_US.UTF-8. It does 'weird' (to someone used to ASCII) things with punctuation and whitespace, skipping on a first pass and considering it only for otherwise tied values.

If you don't want those rules, maybe use the C collation instead.

jjanes
  • 37,812
  • 5
  • 27
  • 34
1

Indeed, I've tried @jjanes's suggestion to use the C collation and the output is the one I would expect:

SELECT
    id, word
FROM testing1
ORDER BY word collate "C" ;

enter image description here

How weird, I have been using postgresql for some years now and I never noticed that behaviour.

Relevant section from the docs:

23.2.2.1. Standard Collations

On all platforms, the collations named default, C, and POSIX are available. > Additional collations may be available depending on operating system support. The default collation selects the LC_COLLATE and LC_CTYPE values specified at database creation time. The C and POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “A” through “Z” are treated as letters, and sorting is done strictly by character code byte values.

babis21
  • 1,515
  • 1
  • 16
  • 29