Perl: Homograph attacks. It is possible to compare ascii / non-ascii strings, visually similar?

Question

I faced this so called "homograph attack" and I want to reject domains where decoded punycode visually seems to be alphanumeric only. For example, www.xn--80ak6aa92e.com will display www.apple.com in browser (Firefox). Domains are visually the same, but character set is different. Chrome already patched this and browser display the punycode.

I have example below.

#!/usr/bin/perl

use strict;
use warnings;

use Net::IDN::Encode ':all';
use utf8;                             


my $testdomain = "www.xn--80ak6aa92e.com";
my $IDN = domain_to_unicode($testdomain);
my $visual_result_ascii = "www.apple.com";

print "S1: $IDN\n";
print "S2: $visual_result_ascii";
print "MATCH" if ($IDN eq $visual_result_ascii);

Visually are the same, but they won't match. It is possible to compare an unicode string ($IDN) against an alphanumeric string, visually the same?

It would help to add your output. Especially helpful would probably also be to see the Unicode sequence of the domain name. — Stefan Becker, Feb 08 '19 at 19:09
I'm not sure if code for this exists, but my first idea would be to create a map \Uxxxx -> "visual equivalent ASCII/UTF-8 code". Then you could apply the map on the Unicode string to "convert" it to ASCII/UTF-8 code and compare the resulting string with a list of domains. — Stefan Becker, Feb 08 '19 at 19:12
./testx.pl IDN: www.аррӏе.com Visually the output is similar. — Claude, Feb 08 '19 at 19:13
The authority on this is the section of UTR#36: Unicode Security Considerations on [Visual Security Issues](http://www.unicode.org/reports/tr36/tr36-8.html#visual_spoofing). — ikegami, Feb 08 '19 at 23:28

Stefan Becker · Answer 1 · 2019-02-08T22:37:07.503

Your example converted by the Punycode converter results in this UTF-8 string:

www.аррӏе.com

$ perl -e 'printf("%02x ", ord) for split("", "www.аррӏе.com"); print "\n"'
77 77 77 2e d0 b0 d1 80 d1 80 d3 8f d0 b5 2e 63 6f 6d

As Unicode:

$ perl -Mutf8 -e 'printf("%04x ", ord) for split("", "www.аррӏе.com"); print "\n"'
0077 0077 0077 002e 0430 0440 0440 04cf 0435 002e 0063 006f 006d

Using @ikegamis input:

$ perl -Mutf8 -MEncode -e 'print encode("UTF-8", $_) for ("www.аррӏе.com" =~ /\p{Cyrillic}/g); print "\n"'
аррӏе
$ perl -Mutf8 -MEncode -e 'print encode("UTF-8", $_) for ("www.аррӏе.com" =~ /\P{Cyrillic}/g); print "\n"'
www..com

Original idea

I'm not sure if code for this exists, but my first idea would be to create a map \N{xxxx} -> "visual equivalent ASCII/UTF-8 code". Then you could apply the map on the Unicode string to "convert" it to ASCII/UTF-8 code and compare the resulting string with a list of domains.

Example code (I'm skipping the IDN decoding stuff and use the UTF-8 result directly in the test data). This could probably still be improved, but at least it shows the idea.

#!/usr/bin/perl
use strict;
use warnings;

use utf8;
use Encode;

# Unicode (in HEX) -> visually equal ASCII/ISO-8859-1/... character
my %unicode_to_equivalent = (
   '0430' => 'a',
   '0435' => 'e',
   '04CF' => 'l',
   '0440' => 'p',
);

while (<DATA>) {
    chomp;

    # assuming that this returns a valid Perl UTF-8 string
    #my $IDN = domain_to_unicode($_);
    my($IDN, $compare) = split(' ', $_) ; # already decoded in test data

    my $visually_decoded =
        join('',              # merge result
             map {            # map, if mapping exists
                 $unicode_to_equivalent{sprintf("%04X", ord($_))} // $_
             }
             split ('', $IDN) # split to characters
        );

    print "Testing: ", encode('UTF-8', $IDN), " -> $compare ";
    print "Visual match!"
        if ($visually_decoded eq $compare);
    print "\n";
}

exit 0;

__DATA__
www.аррӏе.com www.apple.com

Test run (depends if copy & paste from the answer preserves the original UTF-8 strings)

$ perl dummy.pl
Testing: www.аррӏе.com -> www.apple.com Visual match!

Counting the # of scripts in the string

#!/usr/bin/perl
use strict;
use warnings;

use utf8;
use Encode;
use Unicode::UCD qw(charscript);

while (<DATA>) {
    chomp;

    # assuming that this returns a valid Perl UTF-8 string
    #my $IDN = domain_to_unicode($_);
    my($IDN) = $_;  # already decoded in test data

    # Unicod characters
    my @characters = split ('', $IDN);

    # See UTR #39: Unicode Security Mechanisms
    my %scripts =
        map { (charscript(ord), 1) } # Codepoint to script
        @characters;
    delete %scripts{Common};

    print 'Testing: ',
        encode('UTF-8', $IDN),
        ' (', join(' ', map { sprintf("%04X", ord) } @characters), ')',
        (keys %scripts == 1) ? ' not' : '', " suspicious\n";
}

exit 0;

__DATA__
www.аррӏе.com
www.apple.com
www.école.fr

Test run (depends if copy & paste from the answer preserves the original UTF-8 strings)

$ perl dummy.pl
Testing: www.аррӏе.com (0077 0077 0077 002E 0430 0440 0440 04CF 0435 002E 0063 006F 006D) suspicious
Testing: www.apple.com (0077 0077 0077 002E 0061 0070 0070 006C 0065 002E 0063 006F 006D) not suspicious
Testing: www.école.fr (0077 0077 0077 002E 00E9 0063 006F 006C 0065 002E 0066 0072) not suspicious

Your idea seems good, but how to get the rest of unicode equivalents? We need a-z 0-9. — Claude, Feb 08 '19 at 19:21
Hmm, I don't know. But my idea was to block all domains looking similar to an alphanumeric domain. The point of IDN is to translate domains having country-special characters. So, if result of decoded punycode looks alphanumeric, this is a sure sign of scam and all domains must be blocked. — Claude, Feb 08 '19 at 19:28
Chrome already found a patch for this. If all characters visually look alphanumeric, then punycode domain is not decoded displaying xn--... address. — Claude, Feb 08 '19 at 19:30
Following the first "print" example, is equivalent to visually "a", to "p", to "l", to "e". I will try to create a filter based on these pairs mapped to their alphanumeric visual equivalent. Maybe I will succeed. Thanks for idea. — Claude, Feb 08 '19 at 19:52
Finally got some code working, updated my answer. Probably not optimal, but it works. — Stefan Becker, Feb 08 '19 at 20:02
@Claude is the UTF-8 encoded character (Unicode U+0430). For UTF-8 you need to track if the next character is 1, 2, 3, etc bytes long. — Stefan Becker, Feb 08 '19 at 20:06
Great!! For me in Europe it's too late now. I will check tomorrow some other "scammy" domains in punycode to get more "visual" equivalents and complete the list. — Claude, Feb 08 '19 at 20:16
There is unicode data you could possibly use for this, which is used by https://unicode.org/cldr/utility/confusables.jsp, but I'm not sure if that data is available to Perl. — Grinnz, Feb 08 '19 at 20:32
Also see [Text::Unidecode](https://metacpan.org/pod/Text::Unidecode), which converts/transliterates Unicode to plain ASCII — Corion, Feb 08 '19 at 20:50

Claude · Accepted Answer · 2019-02-12T12:00:22.370

After some research and thanks to your comments, I have a conclusion now. The most frequent issues are coming from Cyrillic. This set contains a lot of visually-similar to Latin characters and you can do many combinations.

I have identified some scammy IDN domains including these names:

"аррӏе" "сһаѕе" "сіѕсо"

Maybe here, with this font, you can see a difference, but in browser is absolutely no visual difference.

Consulting https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode I was able to create a table with 12 visually similar characters.

Update: I found 4 more Latin-like characters in Cyrillic charset, 16 in total now.

It is possible to create many combinations between these, to create IDNs 100% visually-similar to legit domains.

0430 a CYRILLIC SMALL LETTER A
0441 c CYRILLIC SMALL LETTER ES
0501 d CYRILLIC SMALL LETTER KOMI DE
0435 e CYRILLIC SMALL LETTER IE
04bb h CYRILLIC SMALL LETTER SHHA 
0456 i CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I 
0458 j CYRILLIC SMALL LETTER JE
043a k CYRILLIC SMALL LETTER KA
04cf l CYRILLIC SMALL LETTER PALOCHKA 
043e o CYRILLIC SMALL LETTER O
0440 p CYRILLIC SMALL LETTER ER
051b q CYRILLIC SMALL LETTER QA 
0455 s CYRILLIC SMALL LETTER DZE
051d w CYRILLIC SMALL LETTER WE 
0445 x CYRILLIC SMALL LETTER HA
0443 y CYRILLIC SMALL LETTER U

The problem is happening with second level domain. Extensions can also be IDN, but they are verified, can not be spoofed and not subject of this issue. Domain registrar will check if all letters are from the same set. IDN will not be accepted if you have a mix of Latin,non-Latin characters. So, extra validation is pointless.

My idea is simple. We split the domain and only decode SLD part, then we match against a visually-similar Cyrillic list. If all letters are visually similar to Latin, then result is almost sure scam.

#!/usr/bin/perl

use strict;
use warnings;

use utf8;
use open ':std', ':encoding(UTF-8)';
use Net::IDN::Encode ':all';
use Array::Utils qw(:all);

my @latinlike_cyrillics = qw (0430 0441 0501 0435 04bb 0456 0458 043a 04cf 043e 0440 051b 0455 051d 0445 0443);

# maybe you can find better examples
my $domain1 = "www.xn--80ak6aa92e.com";
my $domain2 = "www.xn--d1acpjx3f.xn--p1ai";

test_domain ($domain1);
test_domain ($domain2);

sub test_domain {
    my $testdomain = shift;
    my ($tLD, $sLD, $topLD) = split(/\./, $testdomain);
    my $IDN = domain_to_unicode($sLD);

    my @decoded; push (@decoded,sprintf("%04x", ord)) for ( split("", $IDN) );

    my @checker = array_minus( @decoded, @latinlike_cyrillics );
    if (@checker){print "$testdomain [$IDN] seems to be ok\n"}
    else {print "$testdomain [$IDN] is possibly scam\n"}
}

Perl: Homograph attacks. It is possible to compare ascii / non-ascii strings, visually similar?

2 Answers2

Original idea

Counting the # of scripts in the string