Wide charectar in print for some Farsi text, but not others

Question

I'm using Google Translate to convert some error codes into Farsi with Perl. Farsi is one such example, I've also found this issue in other languages---but for this discussion I'll stick to the single example:

The translated text of "Geometry data card error" works fine (Example 1) but translating "Appending a default 111 card" (Example 2) gives the "Wide character" error.

Both examples can be run from the terminal, they are just prints.

I've tried the usual things like these, but to no avail:

use utf8;
use open ':std', ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';

Example 1: This works

perl -Mutf8 -le 'print "\x{d8}\x{ae}\x{d8}\x{b7}\x{d8}\x{a7}\x{db}\x{8c} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d8}\x{af}\x{d8}\x{a7}\x{d8}\x{af}\x{d9}\x{87} \x{d9}\x{87}\x{d9}\x{86}\x{d8}\x{af}\x{d8}\x{b3}\x{db}\x{8c}"'
خطای کارت داده هندسی

Example 2: This produces Wide char warnings and prints noise

perl -Mutf8 -le 'print "\x{d8}\x{a7}\x{d9}\x{81}\x{d8}\x{b2}\x{d9}\x{88}\x{d8}\x{af}\x{d9}\x{86} \x{db}\x{8c}\x{da}\x{a9} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d9}\x{be}\x{db}\x{8c}\x{d8}\x{b4}\x{200c}\x{d9}\x{81}\x{d8}\x{b1}\x{d8}\x{b6} 111"'
Wide character in print at -e line 1.
# <terminal noise, not Farsi text>

Using Curl

If I do the same request with curl I get this:

curl 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=auto&tl=fa&hl=fa&dt=t&ie=UTF-8&oe=UTF-8&otf=1&ssel=0&tsel=0&tk=xxxx&dt=dj&q=%41%70%70%65%6E%64%69%6E%67%20%61%20%64%65%66%61%75%6C%74%20%31%31%31%20%63%61%72%64'
[[["افزودن یک کارت پیش\u200cفرض 111","Appending a default 111 card",null,null,3,null,null,[[]],[[["982c75c78c6c8e6005ec3a4021a7f785","tea_GrecoIndoEuropeA_en2elfahykakumksq_2021q3.md"]]]]],null,"en",null,null,null,1,[],[["en"],null,[1],["en"]]]

Notice the \u200c in the JSON output above which is a "‌Zero Width Non-Joiner" unicode char. When JSON::from_json parses the \u200c it blows up:

perl -Mutf8 -MJSON -e 'print from_json("[\"\\u200c\"]")->[0];'
Wide character in print at -e line 1.

I can "fix" it like this:

my $c = $res->content;
$c =~ s/\\u[0-9a-f]{4}//;
my $json = from_json($c);

and then the output text is correct (right-to-left):

افزودن یک کارت پیشفرض 111

Question: What is going on here?

Is this a bug in Perl or a JSON?
Should \u200c be parsed properly in some other way?

Your first example looks like a bunch of escaped utf-8 bytes, not actual utf-8 encoded text. Your second example mixes that with a escaped Unicode character. Should stick with one or the other style (replace `\x{200c}` with `\xE2\x80\x8C`) — Shawn, Apr 09 '22 at 01:16
The `-CO` option will tell perl to encode Unicode text written to stdout as utf-8 and suppress the warning in your one-liner. See perlrun for more. — Shawn, Apr 09 '22 at 01:23
I don't think any of your code snippets need the utf8 module; that just tells perl the script is encoded in utf-8 and all yours look like plain ASCII. — Shawn, Apr 09 '22 at 01:25
Reading https://perldoc.perl.org/perluniintro is a good idea; it covers some of this in more detail. In particular see https://perldoc.perl.org/perluniintro#Perl's-Unicode-Model for your second example. — Shawn, Apr 09 '22 at 01:31
@Shawn, replacing `\x{200c} with \xE2\x80\x8C` does fix it! Is there a perl way to replace it programmatically? I'll read the perl docs you referenced and see what I come up with, but if you know a quick fix... — KJ7LNW, Apr 09 '22 at 01:47

Shawn · Accepted Answer · 2022-04-09T02:33:34.060

There's a lot of stuff going on here. I think a lot of it, especially in the first two examples, stems from not understanding the difference between perl's two string modes (byte oriented and Unicode codepoint oriented).

Example 1 is a raw byte string holding bytes that happen to be UTF-8 encoded, and are passed through unchanged; as long as the terminal that's displaying the output is expecting UTF-8, they'll be rendered correctly. Example 2 has a 'wide' character (With a value greater than 255), making it a Unicode string, where each character represented by a \x{NN} number greater than 127 is a Unicode codepoint that is encoded as multiple bytes in UTF-8. Printing this causes mojibake and a warning because standard output is byte oriented without a translation layer.

As I suggested in a comment, reading perluniintro (And the other unicode-related documentation) is a good start for learning how things work.

But on to the actual task, extracting text from the JSON returned by your curl commands... I'd use jq instead if this is for a shell script:

$ curl ... | jq -r '.[0][0][0]'
افزودن یک کارت پیش‌فرض 111

Compare to the equivalent perl one-liner:

$ curl ... | perl -CS -MJSON -lne 'print from_json($_)->[0][0][0]'
افزودن یک کارت پیش‌فرض 111

The -CS argument tells perl that standard input, output, and error are all UTF-8 encoded. You could also use -CO to make that just standard output, and use decode_json() instead, which expects raw UTF-8 encoded bytes instead of a Unicode string.

And in a script instead of a one-liner, using the OO interface to JSON and tuning how input strings should be encoded using its methods, plus the open pragma (Or binmode or an encoding layer for open) instead of the -C option, is the way to go.

Look closely and you can see the missing ZWJN between ش‌ف vs شف on the `jq` vs `from_json` versions about 3 glyphs from the left of the Farsi script. -CS might suppress wide char warnings, but from_json is loosing the ZWJN (\u200c). However, as you suggest, running from_json from a JSON object with ->utf8() enabled will fix it. — KJ7LNW, Apr 09 '22 at 02:24
@KJ7LNW I see it in my actual output when I run it through a hex dump. (Might have been lost in this answer due to copy & paste and/or SO stuff.) — Shawn, Apr 09 '22 at 02:28

score 1 · Answer 2 · answered Apr 09 '22 at 02:05

The JSON object needs to have utf8 enabled and it will fix the \u200c. Thanks to @Shawn for pointing me in the right direction:

my $j = JSON->new;
$j->utf8(1);
my $json = $j->decode($c);

Now the JSON-formatted text content like \u200c is correctly transliterated to \xe2\x80\x8c when returning the JSON hash.

Wide charectar in print for some Farsi text, but not others

2 Answers2