Your example converted by the Punycode converter results in this UTF-8 string:
www.аррӏе.com
$ perl -e 'printf("%02x ", ord) for split("", "www.аррӏе.com"); print "\n"'
77 77 77 2e d0 b0 d1 80 d1 80 d3 8f d0 b5 2e 63 6f 6d
As Unicode:
$ perl -Mutf8 -e 'printf("%04x ", ord) for split("", "www.аррӏе.com"); print "\n"'
0077 0077 0077 002e 0430 0440 0440 04cf 0435 002e 0063 006f 006d
Using @ikegamis input:
$ perl -Mutf8 -MEncode -e 'print encode("UTF-8", $_) for ("www.аррӏе.com" =~ /\p{Cyrillic}/g); print "\n"'
аррӏе
$ perl -Mutf8 -MEncode -e 'print encode("UTF-8", $_) for ("www.аррӏе.com" =~ /\P{Cyrillic}/g); print "\n"'
www..com
Original idea
I'm not sure if code for this exists, but my first idea would be to create a map \N{xxxx}
-> "visual equivalent ASCII/UTF-8 code". Then you could apply the map on the Unicode string to "convert" it to ASCII/UTF-8 code and compare the resulting string with a list of domains.
Example code (I'm skipping the IDN decoding stuff and use the UTF-8 result directly in the test data). This could probably still be improved, but at least it shows the idea.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
# Unicode (in HEX) -> visually equal ASCII/ISO-8859-1/... character
my %unicode_to_equivalent = (
'0430' => 'a',
'0435' => 'e',
'04CF' => 'l',
'0440' => 'p',
);
while (<DATA>) {
chomp;
# assuming that this returns a valid Perl UTF-8 string
#my $IDN = domain_to_unicode($_);
my($IDN, $compare) = split(' ', $_) ; # already decoded in test data
my $visually_decoded =
join('', # merge result
map { # map, if mapping exists
$unicode_to_equivalent{sprintf("%04X", ord($_))} // $_
}
split ('', $IDN) # split to characters
);
print "Testing: ", encode('UTF-8', $IDN), " -> $compare ";
print "Visual match!"
if ($visually_decoded eq $compare);
print "\n";
}
exit 0;
__DATA__
www.аррӏе.com www.apple.com
Test run (depends if copy & paste from the answer preserves the original UTF-8 strings)
$ perl dummy.pl
Testing: www.аррӏе.com -> www.apple.com Visual match!
Counting the # of scripts in the string
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
use Unicode::UCD qw(charscript);
while (<DATA>) {
chomp;
# assuming that this returns a valid Perl UTF-8 string
#my $IDN = domain_to_unicode($_);
my($IDN) = $_; # already decoded in test data
# Unicod characters
my @characters = split ('', $IDN);
# See UTR #39: Unicode Security Mechanisms
my %scripts =
map { (charscript(ord), 1) } # Codepoint to script
@characters;
delete %scripts{Common};
print 'Testing: ',
encode('UTF-8', $IDN),
' (', join(' ', map { sprintf("%04X", ord) } @characters), ')',
(keys %scripts == 1) ? ' not' : '', " suspicious\n";
}
exit 0;
__DATA__
www.аррӏе.com
www.apple.com
www.école.fr
Test run (depends if copy & paste from the answer preserves the original UTF-8 strings)
$ perl dummy.pl
Testing: www.аррӏе.com (0077 0077 0077 002E 0430 0440 0440 04CF 0435 002E 0063 006F 006D) suspicious
Testing: www.apple.com (0077 0077 0077 002E 0061 0070 0070 006C 0065 002E 0063 006F 006D) not suspicious
Testing: www.école.fr (0077 0077 0077 002E 00E9 0063 006F 006C 0065 002E 0066 0072) not suspicious