3

I have UTF-16 hex representation such as “0633064406270645” which is "سلام" in Arabic language.

I would like to convert it to its text equivalent. Is there a straight way to do that in PostgreSQL?

I can convert the UTF code point like below; unfortunately it seems UTF16 is not supported. Any ideas on how to do it in PostgreSQL, worst case I will write a function?

SELECT convert_from (decode (E'D8B3D984D8A7D985', 'hex'),'UTF8');

"سلام"

SELECT convert_from (decode (E'0633064406270645', 'hex'),'UTF16');

ERROR:  invalid source encoding name "UTF16"
********** Error **********
Adam
  • 515
  • 1
  • 5
  • 15
  • Thank you guys, both techniques works fine. A function seems to be more accurate than using Unicode escaped sequences. For my application, accuracy is not required, so both techniques would do. – Adam Oct 28 '14 at 15:02

3 Answers3

5

PostgreSQL does not support UTF-16 natively. I suggest you to convert your data to UTF-8 before supplying it to the DB. If it's too late (wrong data already exists in your DB), you can use these maintenance functions to convert data from UTF-16 (logic copied from wikipedia):

-- convert from bytea, containing UTF-16-BE data
CREATE OR REPLACE FUNCTION convert_from_utf16be(utf16_data bytea, invalid_replacement text DEFAULT '?')
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
WITH source(unit) AS (
  SELECT (get_byte(utf16_data, i) << 8) | get_byte(utf16_data, i + 1)
  FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
  SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
  FROM source
)
SELECT string_agg(CASE
  WHEN unit >= 56320 AND unit <= 57343 THEN CASE
    WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
    ELSE invalid_replacement
  END
  WHEN unit >= 55296 AND unit <= 56319 THEN CASE
    WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
    ELSE invalid_replacement
  END
  ELSE chr(unit)
END, '')
FROM codes
$function$;

-- convert from bytea, containing UTF-16-LE data
CREATE OR REPLACE FUNCTION convert_from_utf16le(utf16_data bytea, invalid_replacement text DEFAULT '?')
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
WITH source(unit) AS (
  SELECT get_byte(utf16_data, i) | (get_byte(utf16_data, i + 1) << 8)
  FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
  SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
  FROM source
)
SELECT string_agg(CASE
  WHEN unit >= 56320 AND unit <= 57343 THEN CASE
    WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
    ELSE invalid_replacement
  END
  WHEN unit >= 55296 AND unit <= 56319 THEN CASE
    WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
    ELSE invalid_replacement
  END
  ELSE chr(unit)
END, '')
FROM codes
$function$;

-- convert from bytea, containing UTF-16 data (with or without BOM)
CREATE OR REPLACE FUNCTION convert_from_utf16(utf16_data bytea, invalid_replacement text DEFAULT '?')
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE COALESCE(octet_length(utf16_data), 0)
  WHEN 0 THEN ''
  WHEN 1 THEN invalid_replacement
  ELSE CASE substring(utf16_data FOR 2)
    WHEN E'\\xFFFE' THEN convert_from_utf16le(substring(utf16_data FROM 3), invalid_replacement)
    ELSE convert_from_utf16be(substring(utf16_data FROM 3), invalid_replacement)
  END
END
$function$;

With these functions, you can convert from all kind of UTF-16:

SELECT convert_from_utf16be(decode('0633064406270645D852DF62', 'hex')),
       convert_from_utf16le(decode('330644062706450652D862DF', 'hex')),
       convert_from_utf16(decode('FEFF0633064406270645D852DF62', 'hex')),
       convert_from_utf16(decode('FFFE330644062706450652D862DF', 'hex'));

-- convert_from_utf16be | convert_from_utf16le | convert_from_utf16 | convert_from_utf16
------------------------+----------------------+--------------------+-------------------
-- سلام                | سلام                | سلام              | سلام
pozs
  • 34,608
  • 5
  • 57
  • 63
3

That's right, Postgres doesn't support UTF-16.

However, it does support Unicode escape sequences:

SELECT U&'\0633\0644\0627\0645'

But keep in mind that Unicode code points and UTF-16 code units are only equivalent in the Basic Multilingual Plane. In other words, if you have any UTF-16 characters which span multiple 16-bit code units, you'll need to translate them to the corresponding code point yourself.

Nick Barnes
  • 19,816
  • 3
  • 51
  • 63
3

convert_from (or PostgreSQL in general) doesn't support UTF-16, but you may resort to one of the optional languages that do.

Example in plperlu (requires database superuser privileges to create the function, and CREATE LANGUAGE plperlu if not created already):

CREATE FUNCTION decode_utf16(text) RETURNS text AS $$
  require Encode;
  return Encode::decode("UTF-16BE", pack("H*", $_[0]));
$$ immutable language plperlu;

=> select decode_utf16('0633064406270645');

 decode_utf16 
--------------
 سلام
Daniel Vérité
  • 58,074
  • 15
  • 129
  • 156