How to use Perl pack to convert UTF-16 surrogate pairs to UTF-8?

Question

I have input strings which contain text in which some characters are in UTF-16 format and escaped with '\u'. I am trying to, in Perl, convert all the strings to UTF-8. For example, the string 'Alice & Bob & Carol' might be formatted in the input as:

'Alice \u0026 Bob \u0026 Carol'

To do my desired conversion, I was doing...:

$str =~ s/\\u([A-Fa-f0-9]{4})/pack("U", hex($1))/eg;

...which worked fine until I got to input strings that contained UTF-16 surrogate pairs like:

'Alice \ud83d\ude06 Bob'

How do I modify the above code that uses pack to work with UTF-16 surrogate pairs? I would really like a solution that just uses pack without having to use any additional libraries (JSON::XS, Encode, etc.).

What do you mean by UTF-16? There's no way that code would work with UTF-16 input. — ikegami, Jun 09 '22 at 00:46
Assuming you don't have UTF-16 but decoded text, you'd use a substitution that looks for a hi surro followed by a lo. Use `ord`/`unpack W` to get the numbers, do some bit twiddling, then use `chr` / `pack W` to create the Code Point. — ikegami, Jun 09 '22 at 00:47
Those sure look like Json strings. Seems likely to me you may encounter other issues with them being Json encoded besides the \u's. Just use a Json decoder and don't make easy things hard for yourself. — ysth, Jun 09 '22 at 03:09
@ikegami: In the input, all the non-ASCII characters started out UTF-16 but have already been escaped with the '\u' escape sequence. — WingedKnight, Jun 09 '22 at 03:49
It doesn't matter what encoding it used to be before it was transformed to ASCII? UTF-8? — ikegami, Jun 09 '22 at 04:34

Shawn · Accepted Answer · 2022-06-09T02:05:01.690

3

pack/unpack have no knowledge of UTF-16 text, just UTF-8 (And UTF-EBCDIC). You have to decode the surrogate pairs manually since you don't want to use a module.

#!/usr/bin/env perl                                                                                                                                                                                                                              
use strict;
use warnings;
use open qw/:locale/;
use feature qw/say/;

my $str = 'Alice \ud83d\ude06 Bob \u0026 Carol';

# Convert surrogate pairs encoded as two \uXXXX sequences
# Only match valid surrogate pairs so adjacent non-pairs aren't counted as one
$str =~ s/\\u((?i)D[89AB]\p{AHex}{2}) # High surrogate in range 0xD800–0xDBFF
          \\u((?i)D[CDEF]\p{AHex}{2}) #  Low surrogate in range 0xDC00–0xDFFF
         /chr( ((hex($1) - 0xD800) * 0x400) + (hex($2) - 0xDC00) + 0x10000 )/xge;
# Convert single \uXXXX sequences
$str =~ s/\\u(\p{AHex}{4})/chr hex $1/ge;

say $str;

outputs

Alice  Bob & Carol

edited Jun 09 '22 at 02:05

answered Jun 09 '22 at 01:48

Shawn

47,241
3
26
60

I'd do it the other way around. Decode the `\u`, then look for `[\x{D800}-\x{DBFF}][\x{DC00}-\x{DFFF}]`. Works for more inputs that way. – ikegami Jun 09 '22 at 04:35
Too many warnings about invalid characters that way. – Shawn Jun 09 '22 at 04:38
Re "*Too many warnings about invalid characters that way.*", You probably didn't make the necessary accompanying change from `hex($x)` to `ord($x)`. – ikegami Jun 09 '22 at 13:44
The solution doesn't work for `{ "a": "\\u2660" }` – ikegami Jun 09 '22 at 13:44
1

@ikegami: If you believe you have a better solution, then can you post code demonstrating your solution as an answer? Thanks. – WingedKnight Jun 10 '22 at 04:05
@WingedKnight, No, I have no intention of writing a JSON parser for you. Perfectly good ones already exist. It would be silly for me to do this. – ikegami Jun 10 '22 at 05:06

How to use Perl pack to convert UTF-16 surrogate pairs to UTF-8?

1 Answers1