Perl's IO::File and use open qw(:utf8)

Question

IO::File->open() doesn't seem to respect use open() in the following program, which is odd to me and seems to be against the documentation. Or maybe I'm doing it wrong. Rewriting my code to not use IO::File shouldn't be difficult.

I expect the output to be

$VAR1 = \"Hello \x{213} (r-caret)";

Hello ȓ (r-caret)
Hello ȓ (r-caret)
Hello ȓ (r-caret)

But I'm getting this error: "Oops: Malformed UTF-8 character (unexpected end of string) in print at ./run.pl line 33."

That doesn't seem right to me at all.

#!/usr/local/bin/perl

use utf8;
use v5.16;
use strict;
use warnings;
use warnings qw(FATAL utf8);
use diagnostics;
use open qw(:std :utf8);
use charnames qw(:full :short);

use File::Basename;
my $application = basename $0;

use Data::Dumper;
$Data::Dumper::Indent = 1;

use Try::Tiny;

my $str = "Hello &#531; (r-caret)";

say Dumper(\$str);

open(my $fh, '<', \$str);
print while ($_ = $fh->getc());
close($fh);
print "\n";

try {
  use IO::File;
  my $fh = IO::File->new();
  $fh->open(\$str, '<');
  print while ($_ = $fh->getc());
  $fh->close();
  print "\n";
}
catch {
  say "\nOops: $_";
};

try {
  use IO::File;
  my $fh = IO::File->new();
  $fh->open(\$str, '<:encoding(UTF-8)');
  print while ($_ = $fh->getc());
  $fh->close();
  print "\n";
}
catch {
  say "\nOops: $_";
};

score 7 · Accepted Answer · answered Jan 30 '13 at 02:28

I believe what's happening here is use open is a lexical pragma meaning it only affects calls to open() in the same lexical scope. Lexical scope is when the code is in the same block. IO::File->open is a wrapper around open() and so is calling open() outside its lexical scope.

{
    use open;

    ...same lexical scope...

    {
        ...inner lexical scope...
        ...inherits from the outer...
    }

    ...still the same lexical scope...
    foo();
}

sub foo {
    ...outside "use open"'s lexical scope...
}

In the example above, even though foo() is called inside use open's lexical scope, the code inside foo() is outside and thus not under its effect.

It would be polite if IO::File inherited open.pm. This is not trivial but possible. A similar problem plagued autodie. It was fixed and the fix could probably work in IO::File.

Well that makes sense then. I'll rewrite the code to call open() directly. Thanks. — Michael, Jan 30 '13 at 02:33

ikegami · Answer 2 · 2013-01-30T03:26:56.690

[This is not an answer, but a notification of a bug that doesn't fit in a comment.]

Files can only contain bytes. $str contains values that aren't bytes. Therefore,

open(my $fh, '<', \$str)

makes no sense. It should be

open(my $fh, '<', \encode_utf8($str))

use utf8;
use v5.16;
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open qw( :std :utf8 );
use Encode qw( encode_utf8 );
use Data::Dumper qw( Dumper );

sub dump_str {
   local $Data::Dumper::Useqq = 1;
   local $Data::Dumper::Terse = 1;
   local $Data::Dumper::Indent = 0;
   return Dumper($_[0]);
}

for my $encode (0..1) {
   for my $orig ("\x{213}", "\x{C9}", substr("\x{C9}\x{213}", 0, 1)) {
      my $file_ref = $encode ? \encode_utf8($orig) : \$orig;
      my $got = eval { open(my $fh, '<', $file_ref); <$fh> };
      printf("%-10s  %-6s  %-9s => %-10s => %s\n",
         $encode ? "bytes" : "codepoints",
         defined($got) && $orig eq $got ? "ok" : "not ok",
         dump_str($orig),
         dump_str($$file_ref),
         defined($got) ? dump_str($got) : 'DIED',
      );
   }
}

Output:

codepoints  ok      "\x{213}" => "\x{213}"  => "\x{213}"
codepoints  not ok  "\311"    => "\311"     => DIED
codepoints  not ok  "\x{c9}"  => "\x{c9}"   => DIED
bytes       ok      "\x{213}" => "\310\223" => "\x{213}"
bytes       ok      "\311"    => "\303\211" => "\x{c9}"
bytes       ok      "\x{c9}"  => "\303\211" => "\x{c9}"

I don't think that's right. use v5.16; turns on unicode strings and use utf8; enables unicode in the source file. Reading from the string without the encode_utf8() should (and does) work. — Michael, Jan 30 '13 at 02:41
As you can see in the code I've added, you're wrong about it working. If you use `:utf8` (as opposed to another encoding), it will only work for some strings. (And when it does, it should give a "Wide character" warning or error. It's a bug in Perl that it doesn't.) — ikegami, Jan 30 '13 at 03:11
`$fh` produces decoded characters because its :utf8 layer decodes the stream. That means the stream must start encoded, but you're passing an decoded string. — ikegami, Jan 30 '13 at 03:17
Missing from above comment: And if you use an `:encoding` other than `:utf8`, it will work even less often. — ikegami, Jan 30 '13 at 06:34

Perl's IO::File and use open qw(:utf8)

2 Answers2