4

I need to produce some UTF-16LE encoded files with CRLF line separators on a Windows 7 box. (Currently with a Strawberry 5.20.1)

I needed to mess a long time before getting a correct output and I wonder if my solution is the correct way to do because it seems overcomplicated in regard of other languages along Perl. In particular:

  • why Perl is making a valid UTF-16 big-endian with correct BOM with encoding(UTF-16) while there is no BOM if I use either UTF-16LE or UTF-16BE without using an additional package File::BOM?
  • why out-of-the-box the CRLF handling seems buggy (it is outputted as 0D 0A 00 instead of 0D 00 0A 00) whithout some twiddling of the filters? I doubt it could be a true bug for a language with so many users...

Here are my attempts with comments, what I found correct is the last statements

use strict;
use warnings;
use utf8;
use File::BOM;
use feature 'say';

my $UTF;
my $data = "Hello, héhé, 中文.\nsecond line : my 2€"; # 中文 = zhong wen = chinese

# UTF16 BE + BOM but incorrect CRLF: "0D 0A 00" instead of "0D 00 0A 00"
open $UTF, ">:encoding(UTF-16)", "utf-16-std-be.txt" or die $!;
say $UTF $data;
close $UTF;

# same as UTF-16BE (no BOM, incorrect CRLF)
open $UTF, ">:encoding(ucs2)", "utf-ucs2.txt" or die $!;
say $UTF $data;
close $UTF;

# UTF16 BE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16BE)", "utf-16-be-nobom.txt" or die $!;
say $UTF $data;
close $UTF;

# UTF16 LE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16LE)", "utf-16-le-nobom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;

# UTF16 LE, BOM OK but still incorrect CRLF
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;

# UTF16 LE non raw incorrect 
# (crlf by default on windows) -> 0A => 0D 0A
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf2.txt" or die $!;
print $UTF $data, "\x0a"; # 0A is magically expanded to 0D 0A but wrong
close $UTF;

# UTF16 LE + BOM + LF 
# raw -> 0A => 0A
# could be correct on UNIX but I need CRLF
open $UTF, ">raw::encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf3.txt" or die $!;
say $UTF $data;
close $UTF;

# manual BOM, but CRLF OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf", "utf-16-le-bommanual-crlfok.txt" or die $!;
print $UTF "\x{FEFF}";
say $UTF $data;
close $UTF;

#auto BOM, CRLF OK ?
#incorrect, says utf8 "\xA9" does not map to Unicode at c:/perl/Dwimperl-5.14/perl/lib/Encode.pm line 176.
# But I cannot see where the A9 comes from ??!
#~ open $UTF, ">:raw:encoding(UTF-16LE):via(File::BOM):crlf", "utf-16-le-autobom-crlfok1.txt" or die $!;
#~ print $UTF $data;
#~ say $UTF $data;
#~ close $UTF;

# WTF? \n becomes 0D 00 0D 0A 00
open $UTF, ">:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlf2.txt" or die $!;
say $UTF $data;
close $UTF;

#CORRECT WAY?? : Automatic BOM, CRLF is OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlfok3.txt" or die $!;
say $UTF $data;
close $UTF;
Seki
  • 11,135
  • 7
  • 46
  • 70
  • 1
    The method you have tagged with *CORRECT WAY?* is the recommended approach, but as I commented on ikegami's answer, my preference is to print the BOM explicitly immediately after opening the file. Rather than relying on the "magic number" FEFF you can specify the character by Unicode name, as in `print $UTF "\N{BOM}"` – Borodin Aug 14 '15 at 17:13

1 Answers1

7

manual BOM, but CRLF OK

Yes, the following is indeed correct:

:raw:encoding(UTF-16LE):crlf + manual BOM
  • :raw "clears" the existing :crlf and :encoding layers.
  • :encoding converts between bytes and Code Points.
  • :crlf converts between CRLF and LF.

So,

                               Read
        ===================================================>

                               Code                 Code
+------+   bytes   +------+   Points   +-------+   Points   +------+
| File |-----------| :enc |------------| :crlf |------------| Code |
+------+           +------+    CRLF    +-------+     LF     +------+ 

        <===================================================
                               Write

You want to perform the CRLF⇔LF conversion on the Code Points (not the bytes), as it does with this setup.


CORRECT WAY?? : Automatic BOM, CRLF is OK

While :raw:encoding(UTF-16LE):crlf:via(File::BOM) may work for a write handle, it doesn't look right (I would have expected :raw:via(File::BOM,UTF-16LE):crlf), and it fails miserably for a read handle (at least for me with Perl 5.16.3).

I just looked, and the code behind :via(File::BOM) does some very questionable things. I wouldn't use it.


why Perl is making a valid UTF-16 big-endian with correct BOM with encoding(UTF-16) while there is no BOM if I use either UTF-16LE or UTF-16BE without using an additional package File::BOM

Because you might not want a BOM.

why out-of-the-box the CRLF handling seems buggy

Adding layers adds them at the end of the list. If you want to add a layer elsewhere (as is the case here), you need to rebuild the list.

It was suggested on the development list for Perl that there should be a way distinguishing between byte layers (e.g. :unix) and text layers (e.g. :crlf), and that adding a byte or encoding layer should dig down and place it at the appropriate spot. But noone's acted on this yet.

In addition to simplifying your code, it would allow an UTF-16*[1] encoding layer to be added to STDIN/STDOUT/STDERR (or other existing handles). I believe that's not currently possible.


  1. Technically, any encoding for which CR != 13 or LF != 10 has this problem, so EBCDIC is also affected.
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • As I mentioned in the blog post referenced in [my answer](http://www.nu42.com/2014/05/why-is-perliofcrlf-set-on-bottom-most.html), it seems to me `PERLIO_F_CRLF` should ***not*** be set on the bottom-most 'unix' layer on Windows. At least, that how I read [perliol](http://perldoc.perl.org/perliol.html#Core-Layers): "*Even on platforms that distinguish between `O_TEXT` and `O_BINARY` this layer is always `O_BINARY`.*" – Sinan Ünür Aug 14 '15 at 16:34
  • @Sinan Ünür, `:unix` does use `O_BINARY` as perliol documents. And not adding `:crlf` will make the Perl output "unreadable" to a lot of programs. Try `perl -e"binmode STDOUT; print qq{a\nb\nc\n}" >a.txt && notepad a.txt` It will also mess up virtually all Perl code that reads text files. – ikegami Aug 14 '15 at 16:41
  • Note that a "manual BOM" can be printed as `"\N{BOM}"` which I think is cleaner than adding `use File::BOM` and `:via(File::BOM)` to your program. If you use an open mode of just `UTF-16` then `Encode` will add a BOM for you but assumes BE encoding. There really should be a way if influencing that assumption. It would also be nice if the `open` pragma allowed multiple layers to be specified – Borodin Aug 14 '15 at 17:06
  • So that something like `:encoding(UTF-16LE):via(File::BOM):crlf` could be specified – Borodin Aug 14 '15 at 17:15
  • @Borodin, Should just be `:via(File::BOM,UTF-16le):crlf` (i.e. `UTF-16le` should be parameter to the BOM handler), though the syntax is probably not currently supported by `:via`. – ikegami Aug 14 '15 at 17:16