7

The root cause for this question is my attempt to write tests for a new option/argument processing module (OptArgs) for Perl. This of course involves parsing @ARGV which I am doing based on the answers to this question. This works fine on systems where I18N::Langinfo::CODESET is defined[1].

On systems where langinfo(CODESET) is not available I would like to at least make a best effort based on observed behaviour. However my tests so far indicate that some systems I cannot even pass a unicode argument to an external script properly.

I have managed to run something like the following on various systems where "test_script" is a Perl script that merely does a print Dumper(@ARGV):

use utf8;
my $utf8   = '¥';
my $result = qx/$^X test_script $utf8/;

What I have found is that on FreeBSD the test_script receives bytes which can be decoded into Perl's internal format. However on OpenBSD and Solaris test_script appears to get the string "\x{fffd}\x{fffd}" which contains only the unicode replacement character (twice?).

I don't know the mechanism underlying the qx operator. I presume it either exec's or shells out, but unlike filehandles (where I can binmode them for encoding) I don't know how to ensure it does what I want. Same with system() for that matter. So my question is what am I not doing correctly above? Otherwise what is different with Perl or the shell or the environment on OpenBSD and Solaris?

[1] Actually I think so far that is only Linux according to CPAN testers results.

Update(x2): I currently have the following running its way through cpantester's setups to test Schwern's hypothesis:

use strict;
use warnings;
use Data::Dumper;

BEGIN {
    if (@ARGV) {
        require Test::More;
        Test::More::diag( "\npre utf8::all: "
              . Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
    }
}

use utf8;
use utf8::all;

BEGIN { 
    if (@ARGV) {
        Test::More::diag( "\npost utf8::all: "
              . Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
        exit;
    }
}

use Encode;
use Test::More;

my $builder = Test::More->builder;
binmode $builder->output,         ':encoding(UTF-8)';
binmode $builder->failure_output, ':encoding(UTF-8)';
binmode $builder->todo_output,    ':encoding(UTF-8)';

my $utf8  = '¥';
my $bytes = encode_utf8($utf8);

diag( "\nPassing: " . Dumper( { utf8 => $utf8, bytes => $bytes, } ) );

open( my $fh, '-|', $^X, $0, $utf8, $bytes ) || die "open: $!";
my $result = join( '', <$fh> );
close $fh;

ok(1);
done_testing();

I'll post the results on various systems when they come through. Any comments on the validity andor correctness of this would be apprecicated. Note that it is not intended to be a valid test. The purpose of the above is to be able to compare what is received on different systems.

Resolution: The real underlying issue turns out to be something not addressed in my question nor by Schwern's answer below. What I discovered is that some cpantesters machines only have an ascii locale installed/available. I should not expect any attempt to pass UTF-8 characters to programs in this type of environment to work. So in the end my problem was invalid test conditions, not invalid code.

I have seen nothing so far to indicate that the qx operator or the utf8::all module have any effect on how parameters are passed to external programs. The critical component appears to be the LANG and/or LC_ALL environment variables, to inform the external program what locale they are running in.

By the way, my original assertion that my code was working on all systems where I18N::Langinfo::CODESET is defined was incorrect.

Community
  • 1
  • 1
Mark Lawrence
  • 331
  • 3
  • 6
  • On a related note, the BSDs seem to be broken in other ways. I can't even type unicode characters through a ssh session to FreeBSD - that results in odd terminal behaviour. – Mark Lawrence Jun 20 '12 at 01:40
  • The unicode-via-ssh probably depends heavily upon which terminal you're using and what your `TERM` is on both systems. – sarnold Jun 20 '12 at 01:45
  • I can't replicate your problem on OS X, but you might want to try [utf8::all](https://metacpan.org/module/utf8::all) to turn on most of the Unicode features including Unicode `@ARGV`. `qx` may also be affected by the `open` pragma, which `utf8::all` uses to make filehandles respect Unicode. – Schwern Jun 20 '12 at 05:18

1 Answers1

2

qx makes a call to the shell and it may be interfering.

To avoid that, use utf8::all to switch on all the Perl Unicode voodoo. Then use the open function to open a pipe to your program, avoiding the shell.

use utf8::all;
my $utf8   = '¥';

open my $read_from_script, "-|", "test_script", $utf8;
print <$read_from_script>,"\n";
Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Avoiding use of the shell with the 3-argument version of open is a good suggestion. However I can't see what effect utf8::all is supposed to have on arguments to the `open` function nor to the underlying `exec` call. – Mark Lawrence Jun 20 '12 at 07:30
  • Looking at the source of utf8::all it actually makes assumptions about the encoding of `@ARGV` that [this](http://stackoverflow.com/questions/2037467/how-can-i-treat-command-line-arguments-as-utf-8-in-perl) warned against doing. However that is getting off topic from this question. – Mark Lawrence Jun 20 '12 at 07:34
  • @MarkLawrence `utf8::all` is having an effect via the `open` pragma. Specifically `use open ":std"` appears to effect pipe opens, probably by making STDOUT use UTF-8. Its a good example of "let somebody else figure it out and use their module". And yes, it is making an assumption about the encoding of `@ARGV`. You have to make an assumption, even if you don't you're assuming ASCII, and UTF-8 is a pretty safe bet. Unfortunately its not one which can be done lexically. – Schwern Jun 20 '12 at 20:27
  • I still fail to see the relevance of the encoding of the STDOUT file-handle with regards the command *arguments*. The arguments are not being passed to the command via STDOUT or any other file handle as far as I know. Somewhere, in some piece of code, my utf8 string becomes an argument to one or more 'exec' system calls. I believe that is where my issue lies. Where I18N::Langinfo::CODESET is available assumptions about `@ARGV` do not have to be made. – Mark Lawrence Jun 21 '12 at 01:44
  • @MarkLawrence Sorry it's not clear, did you try it and it didn't work? If it didn't work, make sure to use utf8::all on both your program and `test_script`. Perl consults the encoding layers set by open.pm for things other than just `open()`. – Schwern Jun 21 '12 at 03:56