1

I want to write a CSV file encoded in UTF-16LE. However, the output in the file gets messed up. There are strange chinese looking letters: ਍挀攀氀氀㄀⸀㄀㬀挀攀氀氀㄀⸀㈀㬀ഀ.

This looks like off-by-one-byte problem mentioned here: Creating UTF-16 newline characters in Python for Windows Notepad

Other threads about Perl and Text::CSV_XS didn't help.

This is how I try it:

#!perl

use strict;
use warnings;
use utf8;
use Text::CSV_XS;

binmode STDOUT, ":utf8";

my $csv = Text::CSV_XS->new({
    binary => 1,
    sep_char => ";",
    quote_char => undef,
    eol => $/,
});

open my $in, '<:encoding(UTF-16LE)', 'in.csv' or die "in.csv: $!";
open my $out, '>:encoding(UTF-16LE)', 'out.csv' or die "out.csv: $!";

while (my $row = $csv->getline($in)) {
    $_ =~ s/ä/æ/ for @$row; # something will be done to the data...
    $csv->print($out, $row);
}


close $in;
close $out;

in.csv contains some test data and it is encoded in UTF-16LE:

header1;header2;
cell1.1;cell1.2;
äöü2.1;ab"c2.2;

The results looks like this:

header1;header2;਍挀攀氀氀㄀⸀㄀㬀挀攀氀氀㄀⸀㈀㬀ഀ
æöü2.1;abc2.2;਍

It is not an option to switch to UTF-8 as output format (which works fine btw).

So, how do I write valid UTF-16LE encoded CSV files using Text::CSV_XS?

Community
  • 1
  • 1
capfan
  • 817
  • 10
  • 26
  • Can you create UTF8 and then use Encode, or Encode::Unicode to transcribe it to UTF-16LE? – DavidO Nov 05 '14 at 18:16
  • Indeed, this was a workaround I thought about. The file content is not UTF-16, another program simply expects UTF-16. But I don't like it, because it's a workaround. I fear that I'm missing something (trivial?), as I assume that Perl modules - especially those having to do with IO stuff - should be able to handle UTF-16 etc. – capfan Nov 05 '14 at 18:22
  • I don't know the answer to this question: Would Text::CSV (not XS) be able to handle UTF-16LE? It doesn't surprise me about the XS module not handling yet another Unicode encoding. – DavidO Nov 05 '14 at 18:24
  • I tried Text::CSV instead of Text::CSV_XS and verified using $csv->is_pp, but there was no change. The output is still messed up the same way as with the XS module. – capfan Nov 05 '14 at 18:29

1 Answers1

5

Perl adds :crlf by default on Windows. It's added first, before your :encoding is added.

That means LF⇔CRLF conversion will be performed before decoding on reads, and after encoding on writes. This is backwards.

It ends up working with UTF-8 despite being done backwards because all of the following conditions are met:

  • The UTF-8 encoding of LF is the same as its Code Point (0A).
  • The UTF-8 encoding of CR is the same as its Code Point (0D).
  • 0A always refers to LF no matter where they are in the file.
  • 0D always refers to CR no matter where they are in the file.

None of those conditions holds true for UTF-16le.

Fix:

open(my $fh_in,  '<:raw:encoding(UTF-16LE):crlf', $qfn_in)
open(my $fh_out, '>:raw:encoding(UTF-16LE):crlf', $qfn_out)
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • For those just finding this: Use the above but first add a BOM to the filehandle, like so `print $fh chr(0xFEFF)`, before writing with the module's write method. Once I did this, Excel displayed the data properly. – sqldoug Jun 08 '18 at 18:25
  • Yes, if you want to add a BOM, you can do that (or `print $fh "\N{BOM}";`), but that has nothing to do with the question at hand. – ikegami Jun 08 '18 at 21:22
  • True, the OP doesn't mention Excel, but I find that readability in that program is a common goal of writing UTF16-LE csv files. – sqldoug Jun 08 '18 at 22:53