1

With bash:

$ echo '\u043f\u0441\u0438\u0445\u043e\u0442\u0435\u0440\u0430\u043f\u0438\u044f.net' | ascii2uni -a U -q

психотерапия.net

How can make this with perl?

use utf8;
use URI::_punycode (decode_punycode,encode_punycode);

$fqdn = "\u043f\u0441\u0438\u0445\u043e\u0442\u0435\u0440\u0430\u043f\u0438\u044f.net";

$fqdn = `echo $fqdn | ascii2uni -a U -q`;
$unicode  = encode_punycode($fqdn);
print "$unicode\n";

returns:

$ perl test.pl

043f044104380445043e0442043504400430043f0438044f.net

Alfons
  • 311
  • 1
  • 8
  • 17
  • Your code does not compile. Please [edit] your code and change it to actually compiling code. Most likely you stripped off a `qw` from the URI::_punicode import: `use URI::_punycode qw(encode_punycode decode_punycode);`. – Corion Dec 19 '18 at 11:43
  • `\u` in Perl does not mean "Unicode escape". Perl uses `\N{U+43f}`. What is the actual input you have? Is it the literal string `\u043f...` or is it a byte sequence containing UTF-8-encoded character 043f ? – Corion Dec 19 '18 at 11:50
  • My input is a fqdn (probably with chinese characters). I use LWP to get it from a web page and it returns "\u" format in the shell. – Alfons Dec 19 '18 at 12:50
  • Is the web page returning JSON? – Grinnz Dec 19 '18 at 18:05

2 Answers2

2

\u in Perl does not mean "Unicode escape". Perl uses the syntax \N{U+43f}. Changing your program to conform to the Perl syntax, it works for me:

#!perl
use strict;
use warnings;
use utf8;
use URI::_punycode qw(decode_punycode encode_punycode);

binmode STDOUT, ':encoding(UTF-8)';

my $fqdn = "\N{U+043f}\N{U+0441}\N{U+0438}\N{U+0445}\N{U+043e}\N{U+0442}\N{U+0435}\N{U+0440}\N{U+0430}\N{U+043f}\N{U+0438}\N{U+044f}.net";
print "FQDN: [$fqdn]\n";

print "\n---\n";
my $punicode  = encode_punycode($fqdn);
print "\n---\n";
print "[$punicode]\n";

This outputs the following for me, which I assume is the intended result:

FQDN: [психотерапия.net]

---

---
[.net-43d3auc5ciekjq7byl]

If you have the fqdn literally as a string like\uabcd\u1234..., you can convert it to Unicode using:

$fqdn =~ s/\\u([[:xdigit:]]{4})/chr(hex($1))/ge;

For further details see the other answer.

See also

Quote and Quote-like operators in Perl for the string escapes

Corion
  • 3,855
  • 1
  • 17
  • 27
  • My input is a fqdn (probably with chinese characters). I use LWP to get it from a web page and it returns "\u" format in the shell. – Alfons Dec 19 '18 at 12:50
0

\uXXXX is not related with Punycode/IDN at all. It seems like JSON string format that represents Unicode characters and you need to use right tools for them.

First off, you have to escape backslashes inside double quotes, or use single quotes.

If you don't need dealing with surrogate pairs, you can simply convert numbers to unicode characters.

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;

my $fqdn = '\u043f\u0441\u0438\u0445\u043e\u0442\u0435\u0440\u0430\u043f\u0438\u044f.net';
$fqdn =~ s/\\u([[:xdigit:]]{4})/chr(hex($1))/ge;

print encode_utf8 $fqdn;
print "\n";

If you have to consider them, you still can convert without non-CORE CPAN modules.

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;

my $fqdn = '\u043f\u0441\u0438\u0445\u043e\u0442\u0435\u0440\u0430\u043f\u0438\u044f.net';

my $re_hex = qr/[[:xdigit:]]{4}/;
my $re_uni = qr/\\u$re_hex/;
my $re_uni_capture = qr/\\u($re_hex)/;

$fqdn = join q{}, map {
    /^$re_uni/
        ? decode 'utf-16-be', pack "n*", map { hex } m/$re_uni_capture/g
        : $_
} split qr/(${re_uni}*)/, $fqdn;

print encode_utf8 $fqdn;
print "\n";

PS: Please someone correct my poor English, thanks

ernix
  • 3,442
  • 1
  • 17
  • 23
  • My input is a fqdn (probably with chinese characters). I use LWP to get it from a web page and it returns "\u" format in the shell. – Alfons Dec 19 '18 at 12:50
  • Could you give us the actual HTTP response? – ernix Dec 19 '18 at 12:54
  • I've three examples of fqdn: شهردانایی.net 湖南九洲国际旅行社.net hydrá2wèb.net – Alfons Dec 19 '18 at 12:55
  • But they all are Unicode characters, no one have "\uXXXX" described in RFC5137. https://tools.ietf.org/html/rfc5137 – ernix Dec 19 '18 at 13:04