3

I am doing an existence check for a file containing non-ASCII characters in its name, using Perl. Even though the file exists, the check always returns false. I am using Strawberry Perl v5.24.0 on a Windows 10 machine.

Here is my code:

use strict;
use warnings;

use Encode;

my $file = '<SOME DIR PATH> / áéíóú.mov';
if (-e $file) {
  print "$file exists";
} else {
  print " $file does not exists" ;
}

I also changed the code page in the cmd shell by running chcp 65001. cmd was then able to recognize the characters but somehow it always returns "Not Exists" for this file.

How can I fix this?

ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
N. Pamnani
  • 57
  • 1
  • 8
  • In what encoding did you save your script? If it's UTF-8, where's `use utf8;`? – choroba Jan 23 '17 at 14:58
  • You need to use your machine's "ANSI" encoding (cp1252? returned by `Win32::GetACP()`) for the builtin system calls. (chcp changes the terminal's OEM code page, which has nothing to do with the ANSI code page used by system calls.) Win32::Unicode provides functions that accept UTF-8 or Unicode strings (I can't remember which). Remember that `-e` is just a wrapper for `stat`. – ikegami Jan 23 '17 at 14:59
  • @ikegami if I encode the filename in cp1252 it should work, but I get the same output then too. `$file = encode("cp1252", $file);` I jsut added this line but still I get the same output. – N. Pamnani Jan 23 '17 at 15:19
  • @choroba Why would i need to use UTF-8? – N. Pamnani Jan 23 '17 at 15:20
  • 1) Did you check that cp1252 is the correct encoding? 2) You have at least one other problem: `$file` can't possibly contain the value you claim it does. Encode the source file using UTF-8 and add `use utf8;` to tell Perl you did. – ikegami Jan 23 '17 at 15:22
  • @ikegami I just checked, my source code is saved in UTF-8 and I added `use utf8;` but still it would return "Not Exists". Also I have encoded $file in cp1252 now. – N. Pamnani Jan 23 '17 at 15:25
  • What about my question? – ikegami Jan 23 '17 at 15:27
  • Also provide the output of: `perl -e"opendir($dh, '') or die $!; while (defined($_ = readdir($dh))) { printf(qq{%v02X\n}, $_); }"` – ikegami Jan 23 '17 at 15:30
  • @ikegami I checked for the encoding now, it returned `1252` – N. Pamnani Jan 23 '17 at 15:34
  • @ikegami This is the output `2E 2E.2E 63.6C.69.70.63.61.6E.76.61.73.5F.31.34.33.34.38.5F.50.72.6F.52.65.73.48.51.5F.50.41.4C.2E.6D.6F.76 E1.E9.ED.F3.FA.2E.6D.6F.76 2D.43.4F.50.59.7E.31.2E.4D.4F.56` – N. Pamnani Jan 23 '17 at 15:36
  • I had a similar issue and found that if I changed directory to the directory containing the file, and then tested for its existence without any path in front of its name, it worked fine. Never did get to the bottom of it. – Mark Setchell Jan 23 '17 at 15:51
  • The file name is indeed the cp1252 encoding of `áéíóú.mov`. So problem lies with `$file`. What's the output of `printf(qq{%v02X\n}, $file);` before you call `encode`. – ikegami Jan 23 '17 at 15:59
  • @Mark Setchell, Your problem was probably that you did `my $dir = <>; open(my $fh, '<', "$dir/foo.txt")` instead of `my $dir = <>; chomp($dir); open(my $fh, '<', "$dir/foo.txt")` – ikegami Jan 23 '17 at 16:02
  • @ikegami Here is the output before encoding - 44.3A.2F.57.6F.72.6B.2F.54.69.6D.65.49.6E.63.56.69.64.65.6F.46.78.2F.7A.6F.6F.6D.5F.76.69.64.65.6F.5F.69.6E.67.65.73.74.2F.31.30.5F.6E.65.73.74.65.64.5F.66.6F.6C.64.65.72.73.2F.66.31.2F.C3.A1.C3.A9.C3.AD.C3.B3.C3.BA.2E.6D.6F.76 – N. Pamnani Jan 23 '17 at 16:05
  • 1
    That's encoded using UTF-8. You didn't add `use utf8;` as instructed. – ikegami Jan 23 '17 at 16:26

1 Answers1

5
use strict;
use warnings;

# Properly decode source code, which is expected to be UTF-8.
# This allows non-ASCII characters in the source.
use utf8;

# Properly decode text received from STDIN.
# Properly encode text sent to STDOUT and STDERR.
use Win32 qw( );
my ( $enc_in, $enc_out, $enc_syscall );
BEGIN {
   $enc_input   = 'cp'.Win32::GetConsoleCP();
   $enc_output  = 'cp'.Win32::GetConsoleOutputCP();
   $enc_syscall = 'cp'.Win32::GetACP();

   binmode STDIN,  ":encoding($enc_input)";
   binmode STDOUT, ":encoding($enc_output)";
   binmode STDERR, ":encoding($enc_output)";
}

use Encode qw( encode );

my $file = 'áéíóú.mov';

if (-e encode($enc_syscall, $file, Encode::FB_CROAK | Encode::LEAVE_SRC)) {
   print("$file exists\n");
}
elsif ($!{ENOENT}) {
   print("$file doesn't exist\n");
}
else {
   die("Can't determine if \"$file\" exists: $!\n");
}

or

use strict;
use warnings;

# Properly decode source code, which is expected to be UTF-8.
# This allows non-ASCII characters in the source.
use utf8;

# Properly decode text received from STDIN.
# Properly encode text sent to STDOUT and STDERR.
use Win32 qw( );
my ( $enc_in, $enc_out, $enc_syscall );
BEGIN {
   $enc_input   = 'cp'.Win32::GetConsoleCP();
   $enc_output  = 'cp'.Win32::GetConsoleOutputCP();
   $enc_syscall = 'cp'.Win32::GetACP();

   binmode STDIN,  ":encoding($enc_input)";
   binmode STDOUT, ":encoding($enc_output)";
   binmode STDERR, ":encoding($enc_output)";
}

use Win32::Unicode::File qw( statW );

my $file = 'áéíóú.mov';

if (statW($file)) {
   print("$file exists\n");
}
elsif ($!{ENOENT}) {
   print("$file doesn't exist\n");
}
else {
   die("Can't determine if \"$file\" exists: $^E\n");
}

The latter isn't limited to paths containing characters of the machine's ANSI charset.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Thanks, it worked and I realized my blunders. Just out of curiosity, what were the outputs for (that you asked) in the previous comments section and if I want to learn more about this where can I head to? – N. Pamnani Jan 24 '17 at 05:02
  • Also when I put all the use statements in a BEGIN block, the above code again gives output as File does not exists. Why? – N. Pamnani Jan 24 '17 at 05:48
  • 1
    The value of the characters that make up the strings (in hex). /// `use utf8;` is lexically-scoped. – ikegami Jan 24 '17 at 06:33