Perl - Convert PC UTF-8 to PC ANSI

Question

I have a file that is encoded PC UTF-8. I would like to convert the file into PC ANSI.

I have tried the below, but I always get the output file to be PC UTF-8.

use Encode;

$infile = $ARGV[0];
open(INFILE, $infile);

my $outfile = "temp.txt";

open(OUTFILE, ">$outfile");

while(<INFILE>) {
  my $row = $_;
  chomp $row;

  $row = Encode::encode("Windows-1252", $row);
  print OUTFILE $row."\n";

}

close INFILE;
close OUTFILE;

It's *slightly* wasteful to `chomp` the line and then append `\n` to it. — Keith Thompson, Feb 25 '13 at 21:58
Can you try it with a *very* small file, say 1 short line with a single non-ASCII character, and show us a hex dump of the input and the output? — Keith Thompson, Feb 25 '13 at 22:07
And this isn't relevant to your problem, but the 3-argument version of `open` is preferred. http://modernperlbooks.com/mt/2010/04/three-arg-open-migrating-to-modern-perl.html — Keith Thompson, Feb 25 '13 at 22:08

ikegami · Accepted Answer · 2013-02-25T23:30:15.760

10

The problem is that you never decode the data you encode.

use strict;
use warnings;
use Encode qw( encode decode );

open(my $INFILE,  '<', $ARGV[0]) or die $!;
open(my $OUTFILE, '>', $ARGV[1]) or die $!;

while (my $utf8 = <$INFILE>) {
   my $code_points = decode('UTF-8', $utf8);    # <-- This was missing.
   my $cp1252 = encode('cp1252', $code_points);
   print $OUTFILE $cp1252;
}

But you can do this a bit more easily:

use strict;
use warnings;

open(my $INFILE,  '<:encoding(UTF-8)',  $ARGV[0]) or die $!;
open(my $OUTFILE, '>:encoding(cp1252)', $ARGV[1]) or die $!;

while (<$INFILE>) {
   print $OUTFILE $_;
}

edited Feb 25 '13 at 23:30

answered Feb 25 '13 at 23:24

ikegami

367,544
15
269
518

1

(`cp1252` is just a shorter way of writing `Windows-1252`) – ikegami Feb 25 '13 at 23:30
1

This seems to work. I just get a message with ""\x{feff}" does not map to cp1252". Any nice way of filtering these out? – user333746 Feb 26 '13 at 01:10
1

If that's the only problem character, you can safely get rid of it using `s/^\x{FEFF}//;` (after decoding). It's the [BOM](http://en.wikipedia.org/wiki/Byte_order_mark). – ikegami Feb 26 '13 at 03:22
Transcoding and replacing some of the contents is not such a rare scenario, for example if you are working on files that includes the encoding in the meta data like HTML. – Wolf Jul 07 '17 at 11:17

amon · Answer 2 · 2013-02-25T22:08:49.120

1

Instead of doing decoding and encoding manually, you should use PerlIO-Layers. You can specify a layer with the binmode function, or in the mode argument to three-arg open:

use strict; use warnings;
use autodie;

open my $INFILE,  '<:utf8',                 $ARGV[0];
open my $OUTFILE, '>:encoding(iso-8859-1)', "temp.txt";
#                   ^-- the layers

while (my $line = <$INFILE>) {
  print $OUTFILE $line;
}

Note that Perl doesn't open files to UTF8 by default, and you have to specify the decoding layer as well. The layer :encoding(utf8) is so common, that you can say :utf8 directly.

You can list all available encodings with

use Encode;
print "$_\n" for Encode->encodings();

edited Feb 25 '13 at 22:08

answered Feb 25 '13 at 22:02

amon

57,091
2
89
149

@user333746 ① Check the list of available encodings, to see what you have currently installed. ② Please compare your code to my updated post; The layer is `:encoding(foo-bar)` (my initial post had a mistake). ③ What version of perl are you running? Why an `eval` – are you under mod_perl? – amon Feb 25 '13 at 22:39
1

Why did you change from Windows-1252 to iso-8859-1? They're not the same, and the OP clearly said he wanted the "ANSI" encoding (which is what Windows calls it's single-byte local encoding, which is Windows-1252 aka cp1252 on most machines, neve iso-8859-1). – ikegami Feb 25 '13 at 23:26

Perl - Convert PC UTF-8 to PC ANSI

2 Answers2