How can I obtain correct non-ASCII command-line arguments in ActiveState Perl?

Question

Running the following command

perl -e "for (my $i = 0; $i < length($ARGV[0]); $i++) {print ord(substr($ARGV[0], $i, 1)), qq{\n}; }" αβγδεζ

on a Windows 7 cmd window with ActiveState Perl v5.14.2 produces the following result:

The above values are nonsensical and don't correspond to any known encoding, so trying to decode them with the approach recommended in How can I treat command-line arguments as UTF-8 in Perl? doesn't help. Changing the command window active code page doesn't change the results.

This might not help you, but out of curiosity I tried on my Linux terminal set to use UTF-8, with Perl 5.12.4 After changing the quoting style to single quotes to avoid the shell interpreting the $ variables I got: 206 177 206 178 206 179 206 180 206 181 206 182 - I checked the first letter alpha and is correct, so I think it's the correct result. — stivlo, Oct 19 '11 at 16:30
Single quotes don't work on Windows, and I believe the correct results are `945 946 947 948 949 950` (http://tlt.its.psu.edu/suggestions/international/bylanguage/greekchart.html#greeklower) — MisterEd, Oct 19 '11 at 16:50
How did you type the characters on the command line? Did you copy and paste from some other program? — Sinan Ünür, Oct 19 '11 at 16:57
@MisterEd yes you're right. My output didn't lose any information but is a byte by byte output as opposed to a character by character output, which I can obtain with -CA switch as you suggested. Happy to have learnt something, thank you. — stivlo, Oct 19 '11 at 17:01

score 3 · Accepted Answer · edited Oct 20 '11 at 06:06

Your system, like every Windows system I know, uses by default the 1252 ANSI code page, so you could try to use

use Encode qw( decode );
@ARGV = map { decode('cp1252', $_) } @ARGV;

Note that cp1252 cannot represent all of those characters, which is why the console and thus Perl actually receives

a 97
ß 223
? 63
d 100
e 101
? 63

There is a "Wide" interface for passing (almost) any Unicode code point to a program, but

The Wide interface is not used when you type in a command at the prompt.
Perl uses the ANSI interface to fetch the parameters, so even if you started Perl using the Wide interface, the parameters would get downgraded to ANSI when Perl fetches them.

Sorry, but this is a "you can't" type of situation. You need a different approach. Diomidis Spinellis suggests changing your system's ANSI code page as follows in Win7:

Control Panel
Region and Language
Administrative
Language for non-Unicode programs
Set the Current language for non-Unicode programs to the language associated with the specific characters (Greek in your case).

At this point, you'd use the encoding of the ANSI code page associated with the new selected encoding instead of cp1252 (cp1253 for Greek).

use Encode qw( decode );
@ARGV = map { decode('cp1253', $_) } @ARGV;

Note that using chcp to modify the code page used within the console window does not affect the code page in which Perl receives its arguments, which is always an ANSI code page. See the examples below (cp737 is the Greek OEM code page, and cp1253 is the Greek ANSI code page. You can find the encodings labeled as 37 and M7 in this document.)

C:\>chcp 737
Active code page: 737

C:\>echo αβγδεζ | od -t x1
0000000 98 99 9a 9b 9c 9d 20 0d 0a

C:\>perl -e "print map sprintf('%x ', ord($_)), split(//, $ARGV[0])" αβγδεζ
e1 e2 e3 e4 e5 e6

C:\>chcp 1253
Active code page: 1253

C:\>echo αβγδεζ | od -t x1
0000000 e1 e2 e3 e4 e5 e6 20 0d 0a

C:\>perl -e "print map sprintf('%x ', ord($_)), split(//, $ARGV[0])" αβγδεζ
e1 e2 e3 e4 e5 e6

@Diomidis Spinellis. I am reverting your edits because `chcp` uses returns the OEM code page, but I believe Perl receives the arguments encoded using the ANSI code page. `chcp` will only work if they happen to be the same. — ikegami, Oct 19 '11 at 20:25
You're right regarding chcp, and I provided an example to illustrate it. There are many OEM code pages, and selecting the right one provides a solution. I modified the answer accordingly; I hope you agree. — Diomidis Spinellis, Oct 20 '11 at 06:08
@Diomidis Spinellis, I just had time for a very quick look, but if I'm correctly guessing what you're saying, this is very interesting and useful information. I'll read it in detail tomorrow. Thanks! — ikegami, Oct 20 '11 at 08:16

MisterEd · Answer 2 · 2011-10-19T16:45:53.957

0

This worked for me (on OS-X, but should be portable):

echo  αβγδεζ |perl -CI -e "chomp($in=<STDIN>);for (my $i = 0; $i < length($in); $i++) {print ord(substr($in, $i, 1)), qq{\n}; }"

That was for STDIN; for ARGV:

perl -CA -e "for (my $i = 0; $i < length($ARGV[0]); $i++) {print ord(substr($ARGV[0], $i, 1)), qq{\n}; }" αβγδεζ

See the -C option in perlrun: http://perldoc.perl.org/perlrun.html#Command-Switches

edited Oct 19 '11 at 16:45

answered Oct 19 '11 at 16:34

MisterEd

1,725
1
14
15

1

Sorry. I get 97 Malformed UTF-8 character (unexpected non-continuation byte 0x3f, immediately after start byte 0xdf) in ord at -e line 1. 0 100 101 63 It is a Windows cmd issue, I'm sure it works fine on Linux, Mac OS X, etc. – Diomidis Spinellis Oct 19 '11 at 16:47
Interesting, I wonder if it has to do with the Windows command line. Does it work if the characters are read from file and/or if the code is put into a script vice `-e`? – MisterEd Oct 19 '11 at 16:52
where do you run your script? from inside a cmd shell? If yes, have you tried starting cmd with /U switch? – stivlo Oct 19 '11 at 16:54
1

That's not portable. That only works in UTF-8 locales. The OP definitely does not have a UTF-8 locale. – ikegami Oct 19 '11 at 17:54

score 0 · Answer 3 · answered Oct 19 '11 at 18:10

If I place the characters in a file (from OS-X), copy it to a windows box (as file.txt), then run:

perl -CI -e "chomp($_=<STDIN>); map{print ord, qq{\n}} split(//)" < file.txt

Then I get the expected:

But if I copy the contents of file.txt to the command line, I get gibberish.

As @ikegami was saying, I don't think it's possible to do from command line since you don't have a UTF-8 locale.

score 0 · Answer 4 · edited Apr 06 '14 at 05:19

0

You could try using https://metacpan.org/pod/Win32::Unicode::Native. It should have what you need.

edited Apr 06 '14 at 05:19

Randal Schwartz

39,428
4
43
70

answered Oct 20 '11 at 09:41

asdf000

79
2

How can I obtain correct non-ASCII command-line arguments in ActiveState Perl?

4 Answers4