4

Ages ago, I found some Perl online which neatly formatted valid XML (tabs and newlines) when it was a single-line. The code is below.

It uses XML::Twig to do that. It creates the XML::Twig object without keep_encoding ($twig = XML::Twig->new()) but if I give it a UTF-8 encoded XML file with a non-ASCII character in it, it produces a file which is not valid UTF-8 according to the isutf8 command on Ubuntu. Opening the files in xxd, I can see the character goes from 2-byte to 1.

If I use my $twig= XML::Twig->new(keep_encoding=>1); the same input produces valid UTF-8 and two bytes are preserved.

According to the Perldoc for keep_encoding

This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and you want to keep it that way, then setting keep_encoding will use theExpat original_string method for character, thus keeping the original encoding, as well as the original entities in the strings.

Why is a non-UTF-8 doc being produced without that option and why does setting it cause the UTF-8-ness to be preserved?

The non-ASCII character is a non-breaking space (c2 a0) by the way.

use strict;
use warnings;
use XML::Twig;
my  $sXML  = join "", (<>);
my  $params = [qw(none nsgmls nice indented record record_c)];
my  $sPrettyFormat  = $params->[3] || 'none';
my $twig = XML::Twig->new();
$twig->set_indent(" "x4);
$twig->parse( $sXML );
$twig->set_pretty_print( $sPrettyFormat );
$sXML      = $twig->sprint;
print $xXML;
mirod
  • 15,923
  • 3
  • 45
  • 65
matt freake
  • 4,877
  • 4
  • 27
  • 56
  • 1
    There are actually two things here: what XML::Twig produces and what you then save in the file. XML::Twig produces $sXML inside perl's memory but has nothing to do with you saving it in a file. – brian d foy Oct 30 '13 at 21:53
  • Thanks @briandfoy. I'll let you get back to Mastering Perl now :-) – matt freake Oct 31 '13 at 10:22

1 Answers1

5

It's hard to test without your data, but I would guess that this is due to Perl printing the file as an ISO-8859-1 file, since it doesn't have any information about its encoding (it gets it "raw" from XML::Parser). Try binmode STDOUT, ':utf8'; before printing.

Also, it may not be a great idea to read the file first and then pass a string to the parser. Using parsefile (on the file name) is safer. You potentially avoid encoding problems.

mirod
  • 15,923
  • 3
  • 45
  • 65
  • Thanks, that worked. Most of the time I code in Java, so I forget that Perl doesn't default to UTF-8. – matt freake Oct 31 '13 at 10:21
  • 1
    it's for backwards compatibility, if Perl had defaulted to printing in utf8 when it first got unicode support, it would have broken lots of existing code. There are other ways to have it default to outputting utf8 though, like the `-C` option. – mirod Oct 31 '13 at 11:49