I'm using Google Translate to convert some error codes into Farsi with Perl. Farsi is one such example, I've also found this issue in other languages---but for this discussion I'll stick to the single example:
The translated text of "Geometry data card error" works fine (Example 1) but translating "Appending a default 111 card" (Example 2) gives the "Wide character" error.
Both examples can be run from the terminal, they are just prints.
I've tried the usual things like these, but to no avail:
use utf8;
use open ':std', ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
Example 1: This works
perl -Mutf8 -le 'print "\x{d8}\x{ae}\x{d8}\x{b7}\x{d8}\x{a7}\x{db}\x{8c} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d8}\x{af}\x{d8}\x{a7}\x{d8}\x{af}\x{d9}\x{87} \x{d9}\x{87}\x{d9}\x{86}\x{d8}\x{af}\x{d8}\x{b3}\x{db}\x{8c}"'
خطای کارت داده هندسی
Example 2: This produces Wide char warnings and prints noise
perl -Mutf8 -le 'print "\x{d8}\x{a7}\x{d9}\x{81}\x{d8}\x{b2}\x{d9}\x{88}\x{d8}\x{af}\x{d9}\x{86} \x{db}\x{8c}\x{da}\x{a9} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d9}\x{be}\x{db}\x{8c}\x{d8}\x{b4}\x{200c}\x{d9}\x{81}\x{d8}\x{b1}\x{d8}\x{b6} 111"'
Wide character in print at -e line 1.
# <terminal noise, not Farsi text>
Using Curl
If I do the same request with curl
I get this:
curl 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=auto&tl=fa&hl=fa&dt=t&ie=UTF-8&oe=UTF-8&otf=1&ssel=0&tsel=0&tk=xxxx&dt=dj&q=%41%70%70%65%6E%64%69%6E%67%20%61%20%64%65%66%61%75%6C%74%20%31%31%31%20%63%61%72%64'
[[["افزودن یک کارت پیش\u200cفرض 111","Appending a default 111 card",null,null,3,null,null,[[]],[[["982c75c78c6c8e6005ec3a4021a7f785","tea_GrecoIndoEuropeA_en2elfahykakumksq_2021q3.md"]]]]],null,"en",null,null,null,1,[],[["en"],null,[1],["en"]]]
Notice the \u200c
in the JSON output above which is a "Zero Width Non-Joiner" unicode char. When JSON::from_json
parses the \u200c
it blows up:
perl -Mutf8 -MJSON -e 'print from_json("[\"\\u200c\"]")->[0];'
Wide character in print at -e line 1.
I can "fix" it like this:
my $c = $res->content;
$c =~ s/\\u[0-9a-f]{4}//;
my $json = from_json($c);
and then the output text is correct (right-to-left):
افزودن یک کارت پیشفرض 111
Question: What is going on here?
- Is this a bug in Perl or a JSON?
- Should
\u200c
be parsed properly in some other way?