It looks to me like you've taken data in the form of 7-bit ascii with 2-byte unicode hex escapes (\uxxxx
) and stored it in a varchar
typed field.
convert_from
is entirely the wrong way to handle that data. It isn't utf-8 except in that 7-bit ASCII is a subset of utf-8. If you forced it through convert_from(..., 'utf-8')
you'd get exactly what you started with, blah\u0123
etc.
The reason that select convert_from(E'\u0422\u0438\u043c\u0430\u0442\u0438', 'UTF8');
appears to work is that the convert_from
does nothing. Here's what happens:
PostgreSQL sees the literal value E'\u0422\u0438\u043c\u0430\u0442\u0438'
and sees that convert_from
wants bytea
input.
PostgreSQL parses the escape-string format of the string literal, decoding the unicode escapes to produce the utf-8 string Тимати
. At this point it's as if you wrote:
SELECT convert_from('Тимати', 'utf-8')
Because convert_from
wants bytea
input, PostgreSQL implicitly casts the data to bytea
, converting the text string to utf-8 encodied binary because that's your database server's text encoding. Now it's as if you wrote:
SELECT convert_from( convert_to('Тимати','utf-8'), 'utf-8')
which when the first conversion is done gets turned into:
SELECT convert_from(BYTEA '\xd0a2d0b8d0bcd0b0d182d0b8', 'utf-8');
So effectively your convert_from
is a very slow and inefficient way to do nothing at all.
It doesn't work that way for a column value instead of a literal, because PostgreSQL will implicitly cast unknown-typed literals in ways it won't implicitly cast known-type column values like varchar
column. That's a type-safety thing.
So, to correctly convert your data, you need to decode those \u
escapes. convert_from
is not what you want for that, because it's designed for converting binary representations of encoded text into the local database text encoding.
PostgreSQL supports that for string literals, as we've determined by seeing what it's doing for E''
literals. I'm currently looking for a function that exposes that decoding to the user, though. decode(...)
doesn't support \u
escapes, only octal escapes...