How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?

Question

For example, given an empty file テスト.txt, how would I make a copy called テスト.txt.copy?

My first crack at it managed to access the file and create the new filename, but the copy generated ãƒ†ã‚¹ãƒˆ.txt.copy.

Here was my first crack at it:

#!/usr/bin/env perl

use strict;
use warnings;

use English '-no_match_vars';
use File::Basename;
use Getopt::Long;

use File::Copy;
use Win32;

my (
    $output_relfilepath,
   ) = process_command_line();

open my $fh, '>', $output_relfilepath or die $!;
binmode $fh, ':utf8';
foreach my $short_basename ( glob( '*.txt') ) {

  # skip the output basename if it's in the glob
  if ( $short_basename eq $output_relfilepath ) {
    next;
  }

  my $long_basename = Win32::GetLongPathName( $short_basename );
  my $new_basename  = $long_basename . '.copy';

  print {$fh} sprintf(
                      "short_basename = (%s)\n" .
                      " long_basename = (%s)\n" .
                      "  new_basename = (%s)\n",
                      $short_basename,
                      $long_basename,
                      $new_basename,
                     );
  copy( $short_basename, $new_basename );
}

printf(
       "\n%s done! (%d seconds elapsed)\n",
       basename( $0 ),
       time() - $BASETIME,
      );

# === subroutines ===

sub process_command_line {

  # default arguments
  my %args
    = (
       output_relfilepath => 'output.txt',
      );

  GetOptions(
             'help'                 => sub { print usage(); exit },
             'output_relfilepath=s' => \$args{output_relfilepath},
            );

  return (
          $args{output_relfilepath},
         );
}

sub usage {
  my $script_name = basename $0;

  my $usage = <<END_USAGE;
======================================================================

Test script to copy files with a UTF-8 filenames to files with
different UTF-8 filenames.  This example tries to make copies of all
.txt files with versions that end in .txt.copy.

  usage: ${script_name} (<options>)

options:

  -output_relfilepath <s>   set the output relative file path to <s>.
                            this file contains the short, long, and
                            new basenames.
                            (default: 'output.txt')

----------------------------------------------------------------------

examples:

  ${script_name}

======================================================================
END_USAGE

  return $usage;
}

Here are the contents of output.txt after execution:

short_basename = (BD9A~1.TXT)
 long_basename = (テスト.txt)
  new_basename = (テスト.txt.copy)

I've tried replacing File::Copy's copy command with a system call:

my $cmd = "copy \"${short_basename}\" \"${new_basename}\"";
print `$cmd`;

and with Win32::CopyFile:

Win32::CopyFile( $short_basename, $new_basename, 'true' );

Unfortunately, I get the same result in both cases (ãƒ†ã‚¹ãƒˆ.txt.copy). For the system call, the print shows 1 file(s) copied. as expected.

Notes:

I'm running Perl 5.10.0 via Strawberry Perl on Windows 7 Professional
I use the Win32 module to access long filenames
The glob returns short filenames, which I have to use to access the file
テスト = test (tesuto) in katakana
I've read perlunitut and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

What's the default encoding of your Windows? EUC-JP? Shift_JIS? — Mike, Feb 21 '10 at 01:44
See also http://stackoverflow.com/questions/2184726/how-do-i-create-a-unicode-directory-on-windows-using-perl — Sinan Ünür, Feb 21 '10 at 02:43
Mike: From general reading and using binmode ':utf8' I think my default encoding is utf-8, but I'm not 100% sure. Sinan: Thanks for the link! — vlee, Feb 21 '10 at 05:20
@vleeshue: Read the post Sinan pointed you to - I have managed to read/write files/directories using Unicode characters in their names on Windows by using his approach. — Nele Kosog, Feb 23 '10 at 10:15

score 3 · Answer 1 · answered Feb 21 '10 at 02:23

3

This should be possible with the CopyFileW function from Win32API::File, which should be included with Strawberry. I've never messed with Unicode filenames myself, so I'm not sure of the details. You might need to use Encode to manually convert the filename to UTF-16LE (encode('UTF16-LE', $filename)).

answered Feb 21 '10 at 02:23

cjm

61,471
9
126
175

That looks good. `CopyFileW` is certainly the underlying system call you'd need to use to do this; annoying it's not part of the `Win32` module. – bobince Feb 21 '10 at 13:46

score 2 · Answer 2 · answered Feb 21 '10 at 00:45

2

You're getting the long filename using Win32, which gives you a UTF-8-encoded string.

However, you're then setting the long filename using plain copy, which uses the C stdlib IO functions. The stdlib functions use the default filesystem encoding.

On modern Linuxes that's usually UTF-8, but on Windows it (sadly) never is, because the system default code page cannot be set to UTF-8. So you'll get your UTF-8 string interpreted as a code page 1252 string on a Western European Windows install, as has happened here. (On a Japanese machine it'd get interpreted as code page 932 — like Shift-JIS — which would come out something like 繝�せ繝�.)

I've not done this in Perl, but I'd suspect the Win32::CopyFile function would be more likely to be able to handle the kind of Unicode paths returned elsewhere in the Win32 module.

answered Feb 21 '10 at 00:45

bobince

528,062
107
651
834

Thanks for the info. I also tried both a standard windows copy (system call) and Win32::CopyFile to no avail (updated the question post with new findings). I'm just (naively?) surprised at how difficult this is :( – vlee Feb 21 '10 at 00:53
Oh dear. If even the `Win32` interface won't accept Unicode filenames you may be pretty much scuppered. Yes, I'm afraid the combination of native-Unicode Windows and byte-string C stdlib is very uncomfortable thanks to Windows's refusal to standardise on UTF-8 for the encoding. It's not possible to handle Unicode filenames from a stdlib-only interface like Perl's core uses. :-( It was impossible on Python too, until special support was added to use the native Windows interfaces. Sorry! – bobince Feb 21 '10 at 01:17

score 1 · Answer 3 · answered May 04 '12 at 07:56

1

Use Encode::Locale:

use Encode::Locale;
use Encode;
use File::Copy;

copy( encode(locale_fs => $short_basename),
      encode(locale_fs => $new_basename) ) || die $!;

answered May 04 '12 at 07:56

godegisel

37
2

score 0 · Answer 4 · answered Feb 23 '10 at 02:26

I successfully duplicated your problem on my Windows machine (Win XP Simplified Chinese version) and my conclusion is that the problem is caused by the font. Choose a Truetype font rather than Raster fonts and see if everything is okay.

My experiment is this:

I first changed the code page of my Windows Console from the default 936 (GBK) to 65001 (UTF-8). by typing C:>chcp 65001
I wrote a scrip that contains the code: $a= "テスト"; print $a; and saved it as UTF-8.
I ran the script from the Console and found "テスト" became "ãƒ†ã‚¹ãƒˆ", which is exactly the same sympton you described in your question.
I changed the Console Font from Raster Fonts to Lucida Console, the console screen gave me this: "テストストトト", which is still not quite right but I assume it is getting closer to the core of the problem.

So althought I'm not 100% sure but the problem is probably caused by the font.

Hope this helps.

score 0 · Answer 5 · answered Jan 17 '15 at 03:28

See https://metacpan.org/pod/Win32::Unicode

#!/usr/bin/perl --
use utf8;
use strict;
use warnings;

my @kebabs = (
  "\x{45B}\x{435}\x{432}\x{430}\x{43F}.txt",               ## ћевап.txt
  "ra\x{17E}nji\x{107}.txt",                               ## ražnjić.txt
  "\x{107}evap.txt",                                       ## ćevap.txt
  "\x{43A}\x{435}\x{431}\x{430}\x{43F}\x{447}\x{435}.txt", ## кебапче.txt
  "kebab.txt",
);

{
    use Win32::Unicode qw/ -native /;
    printW "I \x{2665} Perl"; # unicode console out
    mkpathW 'meat';
    chdirW 'meat';
    for my $kebab ( @kebabs ){
        printW "kebabing the $kebab\n";
        open my($fh), '>:raw', $kebab or dieW Fudge($kebab);
        print $fh $kebab              or dieW Fudge($kebab);
        close $fh                     or dieW Fudge($kebab);
    }
}

sub Fudge {
    use Errno();
    join qq/\n/,
      "Error @_",
      map { "  $_" } int( $! ) . q/ / . $!,
      int( $^E ) . q/ / . $^E,
      grep( { $!{$_} } keys %! ),
      q/ /;
}

How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?

5 Answers5

Linked