1

I am working a perl code base to validate customer input, my goal is to block surrogate characters.

My thought is first encoding the customer input as UTF-16 and

 foreach my $messageChar (@MessageChars) {
   my $messageCharUTF16 = Encode::encode("UTF-16", $messageChar);
   if (($messageCharUTF16 >= 0xD800 && $messageCharUTF16 <= 0xDBFF)|( $messageCharUTF16 >= 0xDC00 && $messageCharUTF16 <= 0xDFFF)) {
      // Then we have surrogate pairs       
   }   
 }

However, I am not getting the correct UTF-16 values from Encode::encode.

How can I reveal the surrogate pairs? Is there any straight-forward way to verify if a string contains surrogate characters in Perl?

Dengke Liu
  • 39
  • 7

1 Answers1

4

It's not clear to me what you want to check, so I shall cover both possibilities.


To check if a decoded string contains any of U+D800..U+DFFF

The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points, and Perl obliges.

$ perl -e'use open ":std", ":encoding(UTF-8)"; print "ABC\N{U+D800}DEF\n";'
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
"\x{d800}" does not map to utf8 at -e line 1.
ABC\x{D800}DEF

To check for those characters, you can use

$str =~ /[\x{D800}-\x{DFFF}]/

To check for any encoding error, you can use

eval { encode("UTF-8", $str, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 }

To check if a decoded string contains a character above U+FFFF

Characters above U+FFFF can't be encoded using UCS-2, and require a surrogates to encode using UTF-16.

$ perl -e'use open ":std", ":encoding(UTF-16le)"; print "\N{U+10000}";' | od -t x2
0000000 d800 dc00
0000004

To check for those characters, you can use

$str =~ /[^\0-\x{FFFF}]/
ikegami
  • 367,544
  • 15
  • 269
  • 518