2

I'm trying to run a simple test whereby I want to have differently formatted binary strings and print them out. In fact, I'm trying to investigate a problem whereby sprintf cannot deal with a wide-character string passed in for the placeholder %s.

In this case, the binary string shall just contain the Cyrillic "д" (because it's above ISO-8859-1)

The code below works when I use the character directly in the source.

But nothing that passes through pack works.

  • For the UTF-8 case, I need to set the UTF-8 flag on the string $ch , but how.
  • The UCS-2 case fails, and I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?

The code:

#!/usr/bin/perl

use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"

# https://perldoc.perl.org/open.html

use open qw(:std :encoding(UTF-8));

sub showme {
   my ($name,$ch) = @_;
   print "-------\n";
   print "This is test: $name\n";

   my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint

   {
      # https://perldoc.perl.org/bytes.html
      use bytes;
      my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
      my $txt  = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
      print $txt,"\n";
   }

   print $ch, "\n";
   print "Combine: $ch\n";
   print "Concat: " . $ch . "\n";
   print "Sprintf: " . sprintf("%s",$ch) . "\n";
   print "-------\n";
}


showme("Cryillic direct" , "д");
showme("Cyrillic UTF-8"  , pack("HH","D0","B4"));  # UTF-8 of д is D0B4
showme("Cyrillic UCS-2"  , pack("HH","04","34"));  # UCS-2 of д is 0434

Current output:

Looks good

-------
This is test: Cryillic direct
Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes

д
Combine: д
Concat: д
Sprintf: д
-------

That's a no. Where does the 176 come from??

-------
This is test: Cyrillic UTF-8
Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no

а
Combine: а
Concat: а
Sprintf: а
-------

This is even worse.

-------
This is test: Cyrillic UCS-2
Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no

0
Combine: 0
Concat: 0
Sprintf: 0
-------
David Tonhofer
  • 14,559
  • 5
  • 55
  • 51
  • What are you trying to accomplish? (I see `use bytes;` and `is_utf8`, neither of which you should be using.) – ikegami Mar 30 '20 at 17:46
  • @ikegami "use bytes" to dump internal structure info, "is_utf8" to check whether the flag is set. What else? – David Tonhofer Mar 30 '20 at 19:19
  • Like you said, both relate to how the string is stored internally. This isn't something you should care about. – ikegami Mar 30 '20 at 19:22
  • @ikegami True, but I'm actually trying to get at a problem with the interface between Perl and curses. Maybe a bug, maybe not. It's dentist's work. – David Tonhofer Mar 30 '20 at 21:11
  • Feel free to ask a question about that :) – ikegami Mar 30 '20 at 21:13
  • ...Except I see that you already did. I hadn't seen that question. I'll have a look. – ikegami Mar 30 '20 at 21:16
  • ...Except there's an accepted answer that seems perfectly reasonable. `wget_wch` returns a string of UCP aka decoded text, which is great. Is there an outstanding issue? – ikegami Mar 30 '20 at 21:18
  • ...Except that your answer suggest that you have problems printing something using printw? Do you have minimal reproducible example? There is something called "The Unicode Bug" where the UTF8 flag is (effectively) given meaning it shouldn't (accidentally or on purpose). It's common for XS modules to suffer from this bug, and a libcurse-interfacing module would be such an XS module. Maybe `printw` it suffers from it. Provide a minimal, runnable demonstration, and I'll have a look. – ikegami Mar 30 '20 at 21:34
  • @ikegami I'm not THAT fast :-) ... moving towards it, this already has burnt two days. – David Tonhofer Mar 31 '20 at 12:32

3 Answers3

4

You have two problems.


Your calls to pack are incorrect

Each H represents one hex digit.

$ perl -e'printf "%vX\n", pack("HH", "D0", "B4")'       # XXX
D0.B0

$ perl -e'printf "%vX\n", pack("H2H2", "D0", "B4")'     # Ok
D0.B4

$ perl -e'printf "%vX\n", pack("(H2)2", "D0", "B4")'    # Ok
D0.B4

$ perl -e'printf "%vX\n", pack("(H2)*", "D0", "B4")'    # Better
D0.B4

$ perl -e'printf "%vX\n", pack("H*", "D0B4")'           # Alternative
D0.B4

STDOUT is expecting decoded text, but you are providing encoded text

First, let's take a look at strings you are producing (once the problem mentioned above is fixed). All you need for that is the %vX format, which provides the period-separated value of each character in hex.

  • "д" produces a one-character string. This character is the Unicode Code Point for д.

    $ perl -e'use utf8; printf("%vX\n", "д");'
    434
    
  • pack("H*", "D0B4") produces a two-character string. These characters are the UTF-8 encoding of д.

    $ perl -e'printf("%vX\n", pack("H*", "D0B4"));'
    D0.B4
    
  • pack("H*", "0434") produces a two-character string. These characters are the UCS-2be and UTF-16be encodings of д.

    $ perl -e'printf("%vX\n", pack("H*", "0434"));'
    4.34
    

Normally, a file handle expects a string of bytes (characters with values in 0..255) to be printed to it. These bytes are output verbatim.[1][2]

When an encoding layer (e.g. :encoding(UTF-8)) is added to a file handle, it expects a string of Unicode Code Points (aka decoded text) to be printed to it instead.

Your program adds an encoding layer to STDOUT (through its use of the use open pragma), so you must provide UCP (decoded text) to print and say. You can obtain decoded text from encoded text using, for example, Encode's decode function.

use utf8;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );

use Encode qw( decode );

say "д";                   # ok  (UCP of "д")
say pack("H*", "D0B4");    # XXX (UTF-8 encoding of "д")
say pack("H*", "0434");    # XXX (UCS-2be and UTF-16be encoding of "д")

say decode("UTF-8",    pack("H*", "D0B4"));   # ok (UCP of "д")
say decode("UCS-2be",  pack("H*", "0434"));   # ok (UCP of "д")
say decode("UTF-16be", pack("H*", "0434"));   # ok (UCP of "д")

For the UTF-8 case, I need to set the UTF-8 flag on

No, you need to decode the strings.

The UTF-8 flag is irrelevant. Whether the flag is set or not originally is irrelevant. Whether the flag is set or not after the string is decoded is irrelevant. The flag indicates how the string is stored internally, something you shouldn't care about.

For example, take

use strict;
use warnings;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );

my $x = chr(0xE9);

utf8::downgrade($x);   # Tell Perl to use the UTF8=0 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;

utf8::upgrade($x);   # Tell Perl to use the UTF8=1 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;

It outputs

UTF8=0 E9 é
UTF8=1 E9 é

Regardless of the UTF8 flag, the UTF-8 encoding (C3 A9) of the provided UCP (U+00E9) is output.


I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?

At best, one could employ heuristics to guess whether a string is encoded using iso-latin-1 or UCS-2be. I suspect one could get rather accurate results (like those you'd get for iso-latin-1 and UTF-8.)

I'm not sure why you bring up iso-latin-1 since nothing else in your question relates to iso-latin-1.


  1. Except on Windows, where a :crlf layer added to handles by default.

  2. You get a Wide character warning if you provide a string that contains a character that's not a byte, and the utf8 encoding of the string is output instead.

ikegami
  • 367,544
  • 15
  • 269
  • 518
1

Please see if following demonstration code of any help

use strict;
use warnings;
use feature 'say';

use utf8;     # https://perldoc.perl.org/utf8.html
use Encode;   # https://perldoc.perl.org/Encode.html

my $str;

my $utf8   = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';

# https://perldoc.perl.org/functions/binmode.html

binmode STDOUT, ':utf8'; 

# https://perldoc.perl.org/feature.html#The-'say'-feature

say 'UTF-8:   ' . $utf8;  

# https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API

$str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);  

$str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);

$str = pack('H*',$utf16);
say 'UTF-16:  '. decode('UTF16',$str);

$str = pack('H*',$utf32);
say 'UTF-32:  ' . decode('UTF32',$str);

Output

UTF-8:   Привет Москва
UCS-2BE: Привет Москва
UCS-2LE: Привет Москва
UTF-16:  Привет Москва
UTF-32:  Привет Москва

Supported Cyrillic encodings

use strict;
use warnings;
use feature 'say';

use Encode;
use utf8;

binmode STDOUT, ':utf8';

my $utf8 = 'Привет Москва';
my @encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;

say '
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8       ', $utf8;

for (@encodings) {
    printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
}

Output

:: Supported Cyrillic encoding
---------------------------------------------
UTF-8       Привет Москва
UCS-2       041f044004380432043504420020041c043e0441043a04320430
UCS-2LE     1f044004380432043504420420001c043e0441043a0432043004
UCS-2BE     041f044004380432043504420020041c043e0441043a04320430
UTF-16      feff041f044004380432043504420020041c043e0441043a04320430
UTF-32      0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
ISO-8859-5  bfe0d8d2d5e220bcdee1dad2d0
CP855       dde1b7eba8e520d3d6e3c6eba0
CP1251      cff0e8e2e5f220cceef1eae2e0
KOI8-F      f0d2c9d7c5d420edcfd3cbd7c1
KOI8-R      f0d2c9d7c5d420edcfd3cbd7c1
KOI8-U      f0d2c9d7c5d420edcfd3cbd7c1

Documentation Encode::Supported

David Tonhofer
  • 14,559
  • 5
  • 55
  • 51
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
1

Both are good answer. Here is a slight extension of Polar Bear's code to print details about the string:

use strict;
use warnings;
use feature 'say';

use utf8;
use Encode;

sub about {
   my($str) = @_;
   # https://perldoc.perl.org/bytes.html
   my $charlen = length($str);
   my $txt;
   {
      use bytes;
      my $mark = (utf8::is_utf8($str) ? "yes" : "no");
      my $bytelen = length($str);
      $txt  = sprintf("Length: %d byte, %d chars, utf-8: %s, contents: %vd\n", 
                      $bytelen,$charlen,$mark,$str);
   }
   return $txt;
}

my $str;

my $utf8   = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';

binmode STDOUT, ':utf8';

say 'UTF-8:   ' . $utf8;
say about($utf8);

{
   my $str = pack('H*',$ucs2be);
   say 'UCS-2BE: ' . decode('UCS-2BE',$str);
   say about($str);
}

{
   my $str = pack('H*',$ucs2le);
   say 'UCS-2LE: ' . decode('UCS-2LE',$str);
   say about($str);
}

{
   my $str = pack('H*',$utf16);
   say 'UTF-16:  '. decode('UTF16',$str);
   say about($str);
}

{
   my $str = pack('H*',$utf32);
   say  'UTF-32:  ' . decode('UTF32',$str);
   say about($str);
}

# Try identity transcoding

{
   my $str_encoded_in_utf16 = encode('UTF16',$utf8);
   my $str = decode('UTF16',$str_encoded_in_utf16);
   say 'The same: ' . $str;
   say about($str);
}

Running this gives:

UTF-8:   Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176

UCS-2BE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48

UCS-2LE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 31.4.64.4.56.4.50.4.53.4.66.4.32.0.28.4.62.4.65.4.58.4.50.4.48.4

UTF-16:  Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48

UTF-32:  Привет Москва
Length: 52 byte, 52 chars, utf-8: no, contents: 0.0.4.31.0.0.4.64.0.0.4.56.0.0.4.50.0.0.4.53.0.0.4.66.0.0.0.32.0.0.4.28.0.0.4.62.0.0.4.65.0.0.4.58.0.0.4.50.0.0.4.48

The same: Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176

And a little diagram I made as an overview for next time, covering encode, decode and pack. Because one better be ready for next time.

perl_strings_and_encode_decode

(The above diagram & its graphml file available here)

David Tonhofer
  • 14,559
  • 5
  • 55
  • 51