4

When i'm trying to convert from unicode to utf8 in the code below "function convert_from(character varying, unknown) does not exist" error occurs.

select convert_from(artists, 'UTF8') from songs where     
to_tsvector('simple',convert_from(artists, 'UTF8')) 
  @@ plainto_tsquery('simple','alizee') 
limit 100

Column "artists" has "TEXT" type.

But when I'm running

select convert_from(E'\u0422\u0438\u043c\u0430\u0442\u0438', 'UTF8');

it works well.

How can I resolve this problem? I would appreciate any help. Thanks

Andrew Lazarus
  • 18,205
  • 3
  • 35
  • 53
Mega4alik
  • 597
  • 3
  • 10
  • 18
  • Please show a sample of a value in the `artists` column. You can't store "unicode" (assuming you mean utf-16/utf-32/ucs-2/ucs-4) in a PostgreSQL `text` field, because it contains null bytes, which in a `text` field terminate the string. So whatever's going on, there isn't enough information here to guess. **Show the data** by editing the question, then comment here when you are done. – Craig Ringer May 15 '14 at 01:51
  • sample value of artists column: "brig\u0442\u0438ada" – Mega4alik May 15 '14 at 05:36
  • aah, so it's actually *text* with unicode character escapes. That's very different. – Craig Ringer May 15 '14 at 07:52
  • Possible duplicate of http://stackoverflow.com/q/20124393/398670, http://stackoverflow.com/q/10111654/398670 – Craig Ringer May 15 '14 at 08:22

2 Answers2

5

From documentation: convert_from(string bytea, src_encoding name). So cast artists to bytea:

select convert_from(artists::bytea, 'UTF8') from songs where     
to_tsvector('simple',convert_from(artists, 'UTF8')) 
  @@ plainto_tsquery('simple','alizee') 
limit 100
klin
  • 112,967
  • 15
  • 204
  • 232
  • 2
    select convert_from(artists::bytea, 'UTF8') from songs where to_tsvector('simple',convert_from(artists::bytea, 'UTF8')) @@ plainto_tsquery('simple','alizee') limit 100. Another error occured: "ERROR: invalid input syntax for type bytea" – Mega4alik May 15 '14 at 00:39
4

It looks to me like you've taken data in the form of 7-bit ascii with 2-byte unicode hex escapes (\uxxxx) and stored it in a varchar typed field.

convert_from is entirely the wrong way to handle that data. It isn't utf-8 except in that 7-bit ASCII is a subset of utf-8. If you forced it through convert_from(..., 'utf-8') you'd get exactly what you started with, blah\u0123 etc.

The reason that select convert_from(E'\u0422\u0438\u043c\u0430\u0442\u0438', 'UTF8'); appears to work is that the convert_from does nothing. Here's what happens:

  • PostgreSQL sees the literal value E'\u0422\u0438\u043c\u0430\u0442\u0438' and sees that convert_from wants bytea input.

  • PostgreSQL parses the escape-string format of the string literal, decoding the unicode escapes to produce the utf-8 string Тимати. At this point it's as if you wrote:

    SELECT convert_from('Тимати', 'utf-8')
    
  • Because convert_from wants bytea input, PostgreSQL implicitly casts the data to bytea, converting the text string to utf-8 encodied binary because that's your database server's text encoding. Now it's as if you wrote:

    SELECT convert_from( convert_to('Тимати','utf-8'), 'utf-8')
    

    which when the first conversion is done gets turned into:

    SELECT convert_from(BYTEA '\xd0a2d0b8d0bcd0b0d182d0b8', 'utf-8');
    

So effectively your convert_from is a very slow and inefficient way to do nothing at all.

It doesn't work that way for a column value instead of a literal, because PostgreSQL will implicitly cast unknown-typed literals in ways it won't implicitly cast known-type column values like varchar column. That's a type-safety thing.

So, to correctly convert your data, you need to decode those \u escapes. convert_from is not what you want for that, because it's designed for converting binary representations of encoded text into the local database text encoding.

PostgreSQL supports that for string literals, as we've determined by seeing what it's doing for E'' literals. I'm currently looking for a function that exposes that decoding to the user, though. decode(...) doesn't support \u escapes, only octal escapes...

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778