18

How do I treat the elements of @ARGV as UTF-8 in Perl?

Currently I'm using the following work-around ..

use Encode qw(decode encode);

my $foo = $ARGV[0];
$foo = decode("utf-8", $foo);

.. which works but is not very elegant.

I'm using Perl v5.8.8 which is being called from bash v3.2.25 with a LANG set to en_US.UTF-8.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
knorv
  • 49,059
  • 74
  • 210
  • 294
  • 2
    Just a subtle nit: ARGV by itself normally denotes the filehandle named ARGV. The answer is a bit different for @ARGV, the array that holds the command-line parameters. :) – brian d foy Jan 10 '10 at 16:03

5 Answers5

31

Outside data sources are tricky in Perl. For command-line arguments, you're probably getting them as the encoding specified in your locale. Don't rely on your locale to be the same as someone else who might run your program.

You have to find out what that is then convert to Perl's internal format. Fortunately, it's not that hard.

The I18N::Langinfo module has the stuff you need to get the encoding:

    use I18N::Langinfo qw(langinfo CODESET);
    my $codeset = langinfo(CODESET);

Once you know the encoding, you can decode them to Perl strings:

    use Encode qw(decode);
    @ARGV = map { decode $codeset, $_ } @ARGV;

Although Perl encodes internal strings as UTF-8, you shouldn't ever think or know about that. You just decode whatever you get, which turns it into Perl's internal representation for you. Trust that Perl will handle everything else. When you need to store the data, ensure that you use the encoding you like.

If you know that your setup is UTF-8 and the terminal will give you the command-line arguments as UTF-8, you can use the A option with Perl's -C switch. This tells your program to assume the arguments are encoded as UTF-8:

% perl -CA program
brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • 2
    My problem with this answer is that I18N::Langinfo is not available on Win32 (even though it is in corelist!). – MichielB Sep 13 '12 at 15:26
  • 1
    My perl (5.18.0, Mac OS X 10.8) is returning US-ASCII in $codeset, even though my terminal is set to unicode(UTF-8). The decode() works if I set $codeset to UTF-8 manually. – Michael Jun 17 '13 at 23:25
  • 1
    This returns `UTF-8` for me with v5.18 and X.8: `$ perl5.18.0 -MI18N::Langinfo=langinfo,CODESET -E 'say langinfo( CODESET )'`. Are you sure you have things setup correctly? – brian d foy Jun 18 '13 at 03:58
9

Use Encode::Locale:

use Encode::Locale;

decode_argv Encode::FB_CROAK;

This works, also on Win32, pretty OK for me.

Smylers
  • 1,673
  • 14
  • 18
MichielB
  • 4,181
  • 1
  • 30
  • 39
  • Which version of perl do you find `Encode::Locale` in? I've got v5.10.1, and trying `use Encode::Locale` results in the module not being found. :( – zrajm Jan 04 '14 at 11:34
  • it is not in core, you can install it off cpan or your package manager. – MichielB Jan 07 '14 at 12:52
  • In my case `decode_argv` is not imported by default, so `use Encode::Locale qw(decode_argv);` is required. – Javier Elices Nov 03 '21 at 14:28
4

The way you've done it seems correct. That's what I would do.

However, this perldoc page suggests that the command line flag -CA should tell it to treat @ARGV as utf-8. (not tested).

dma_k
  • 10,431
  • 16
  • 76
  • 128
FalseVinylShrub
  • 1,213
  • 9
  • 10
  • 3
    -CA expects to command-line arguments to be encoded as UTF-8. That doesn't mean that they are. :) – brian d foy Jan 10 '10 at 15:57
  • 1
    Thanks for the info, so you're saying this way assumes UTF-8 encoding, but your way goes and finds out the encoding...? – FalseVinylShrub Jan 10 '10 at 16:07
  • I've found that it's never safe to assume any encoding. Too many people get it to work on their machine then find out it breaks for someone else who has a different setup. – brian d foy Jan 10 '10 at 17:13
  • 2
    Note that this doesn't work in a script, i.e. you can't do `#!/usr/bin/perl -CA`. Or at least it failed for a script I downloaded. – Timmmm Nov 02 '12 at 09:47
1

For example for windows set code

chcp 1251

in perl:

use utf8;
use Modern::Perl;
use Encode::Locale qw(decode_argv);

 if (-t)
{
    binmode(STDIN, ":encoding(console_in)");
    binmode(STDOUT, ":encoding(console_out)");
    binmode(STDERR, ":encoding(console_out)");
}

Encode::Locale::decode_argv();

in command line

perl -C ppixregexplain.pl qr/\bмама\b/i > ex1.html 2>&1  

where ppixregexplain.pl

0

You shouldn't have to do anything special to the string. Perl strings are in UTF-8 by default starting with Perl 5.8.

perl -CO -le 'print "\x{2603}"' | xargs perl -le 'print "I saw @ARGV"'

The code above works just fine on Ubuntu 9.04, OS X 10.6, and FreeBSD 7.

FalseVinylShrub brings up a good point, We can see a definite difference between

perl -Mutf8 -wle ';print utf8::is_utf8($ARGV[0]) ? "t" : "f"' a

and

perl -Mutf8 -CA -wle ';print utf8::is_utf8($ARGV[0]) ? "t" : "f"' a
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
  • 2
    The command-line arguments don't start life as Perl strings, though. It's an external data source like anything else. – brian d foy Jan 10 '10 at 15:56
  • 2
    But if his or her shell is set to UTF-8, then anything he or she types will be in UTF-8. – Chas. Owens Jan 10 '10 at 16:05
  • 1
    I find it easier to specify the working environment than to try cover all possible environments. Now, if this is meant to be distributed to other people, that changes things, but the question included the fact that the terminal will be set to UTF-8. Similarly, most of the time I don't mess with `File::Spec`, even though my code won't work on certain systems. – Chas. Owens Jan 10 '10 at 21:25