3

I need to make shure that the output file i create with my perl script has the codeset cp1252 and not UTF-8 because it will be used within a UNIX SQLplus framework which handles german "umlauts" not correctly when it insert that values into the database columns (I use strawberry perl v5.18 within Windows 10 and i cannot set NLS_LANG or chcp within the UNIX SQL environment).

With this little test script i can reproduce that the output file "testfile1.txt" is allways in UTF-8 but "testfile2.txt" is CP1252 as expected. How can i force the output for "testfile1.txt" to be also CP1252 even if there are no "special" chars within the text ?

#!/usr/bin/env perl -w
use strict;
use Encode;

# the result file under Windows 10 will have UTF-8 codeset
open(OUT,'> testfile1.txt');    
binmode(OUT,"encoding(cp-1252)");
print OUT encode('cp-1252',"this is a test");
close(OUT);

# the result file under Windows 10 will have Windows-cp1252 codeset
open(OUT,'> testfile2.txt');    
binmode(OUT,"encoding(cp-1252)");
print OUT encode('cp-1252',"this is a test with german umlauts <ÄäÜüÖöß>");
close(OUT);
drvolk
  • 35
  • 4

1 Answers1

6

I think your question is based on a misunderstanding. testfile1.txt contains the text this is a test. These characters have the same encoding in ASCII, Latin-1, UTF-8, and CP-1252. testfile1.txt is valid in all of these encodings simultaneously.


To include literal Unicode characters in your source code like this:

print OUT encode('cp-1252',"this is a test with german umlauts <ÄäÜüÖöß>");

you need

use utf8;

at the top.

Also, don't combine encoding layers on filehandles with explicit encode() calls. Either set an encoding layer and print Unicode text to it, or use binmode(OUT) and print raw bytes (as returned from encode()) to it.


By the way, you shouldn't use -w anymore. It's been supplanted by the

use warnings;

pragma.

Similarly, bareword filehandles and two-argument open are pre-5.6 style code and shouldn't be used in code written after 2000. (perl 5.005 and earlier didn't support Unicode/encodings anyway.)

A fixed version of your code looks like this:

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;

{
    open(my $out, '>:encoding(cp-1252)', 'testfile1.txt') or die "$0: testfile1.txt: $!\n";    
    print $out "this is a test\n";
    close($out);
}

{
    open(my $out, '>encoding(cp-1252)', 'testfile2.txt') or die "$0: testfile2.txt: $!\n";    
    print $out "this is a test with german umlauts <ÄäÜüÖöß>\n";
    close($out);
}
melpomene
  • 84,125
  • 8
  • 85
  • 148
  • 1
    _Schei� Encoding..._ :) – simbabque Nov 01 '17 at 15:34
  • Thanks a lot for that fast and great answer. So the "encoding type" is not stored as a "key" within the file but it will just defined by its binary values stored in it, right ? My missunderstatement came from the case that my editor gives me the option to store a file with a selected codeset .. By the way, what is the reason to enclose the two "open" statement within brackets ? – drvolk Nov 02 '17 at 07:00
  • @drvolk Correct. Text files don't store their own encoding. Programs have to just know somehow what the encoding is (or take a guess based on the bytes in the file, but that can go wrong). The blocks in my code are to limit the scope of the `$out` handle. – melpomene Nov 02 '17 at 09:13