Use perl LibXML Element->getAttribute() without expanding unicode entities in value

Question

I am currently trying to create a perl script that uses LibXML to process data in an SVG font.

In an SVG font, each character is defined as a glyph element with an unicode attribute that defines its unicode address in the form of a unicode entity; like so:

<glyph unicode="&#x2000;" />

Part of want I want to do is take the value of each glyph element's unicode attribute and process it like a string. However, when I use Element->getAttribute('unicode'); against a glyph node, it returns a "wide character" that displays as a placeholder rectangle, leading me to believe that it expands the unicode entity into a unicode character and returns that.

When I create my parser, I set expand_entities to 0, so I am not sure what else I could do to prevent this. I am rather new with XML processing, so I'm not sure I actually understand what's going on or if this is even supposed to be preventable.

Here is a code sample:

use utf8;
use open ':std', ':encoding(UTF-8)';
use strict;
use warnings;
use XML::LibXML;
$XML::LibXML::skipXMLDeclaration = 1;

my $xmlFile = $ARGV[0];

my $parser = XML::LibXML->new();
$parser->load_ext_dtd(0);
$parser->validation(0);
$parser->no_network(1);
$parser->recover(1);
$parser->expand_entities(0);

my $xmlDom = $parser->load_xml(location => $xmlFile);

my $xmlDomSvg = XML::LibXML::XPathContext->new();
$xmlDomSvg->registerNs('svg', 'http://www.w3.org/2000/svg');

foreach my $myGlyph ($xmlDomSvg->findnodes('/svg:svg/svg:defs/svg:font/svg:glyph', $xmlDom))
{
  my $myGlyphCode = $myGlyph->getAttribute('unicode');
  print $myGlyphCode . "\n";
}

Note: If I run print $myGlyph->toString();, the unicode entity in the output is not expanded, hence why I'm concluding that the expansion is happening in the getAttribute method.

Stefan Becker · Answer 1 · 2019-02-27T21:18:03.333

This might not be the answer you are looking for, but IMHO getAttribute gives you enough information, i.e. a Perl string, to solve your issue in another way. You are trying to write that Perl string to a non-UTF8 file, that's why you get the "wide character" warning.

A stripped-down example of how to get the U+xxxx value you are looking for:

use strict;
use warnings;
use open qw(:encoding(UTF-8) :std);

use XML::LibXML;

my $dom = XML::LibXML->load_xml(IO => \*DATA)
    or die "XML\n";
my $root = $dom->documentElement();
print $root->toString(), "\n";

my $attr = $root->getAttribute('unicode');
printf("'%s' is %d (U+%04X)\n", $attr, ord($attr), ord($attr));

exit 0;

__DATA__
<glyph unicode="&#x2000;" />

Test run:

$ perl dummy.pl
<glyph unicode="&#x2000;"/>
' ' is 8192 (U+2000)

UPDATE: The documentation for expand_entities is IMHO misleading. It talks about "entities", but it obviously means ENTITY definitions, i.e. new entities introduced in the document. The libxml2 documentation is unfortunately not much clearer. But this old message seems to indicate that the behavior you describe is expected, ie. a XML parser should always replace pre-defined entities:

#!/usr/bin/perl
use warnings;
use strict;

use XML::LibXML;

my $parser = XML::LibXML->new({
    expand_entities => $ARGV[0] ? 1 : 0,
});

my $dom = $parser->load_xml(IO => \*DATA)
    or die "XML\n";

my $root = $dom->documentElement();
print "toString():  ", $root->toString(), "\n";
print "textContent: ", $root->textContent(), "\n";

my $attr = $root->getAttribute('test');
print "attribute:   ${attr}\n";

exit 0;

__DATA__
<?xml version="1.0"?>
<!DOCTYPE foo [
<!ENTITY author "Fluffy Bunny">
]>
<tag test="&lt;&author;&gt;">&lt;&author;&gt;</tag>

Test run:

$ perl dummy.pl 0
toString():  <tag test="&lt;&author;&gt;">&lt;&author;&gt;</tag>
textContent: <Fluffy Bunny>
attribute:   <Fluffy Bunny>

$ perl dummy.pl 1
toString():  <tag test="&lt;Fluffy Bunny&gt;">&lt;Fluffy Bunny&gt;</tag>
textContent: <Fluffy Bunny>
attribute:   <Fluffy Bunny>

Thank you. Indeed this solves my problem in this specific context, but before accepting this answer, I'll let this question sit to see if anyone has a more global solution. — Quote, Feb 27 '19 at 19:41
@Bluewoods This is the proper solution to your problem. When you have text (which is what you get from decoding the XML) you must always encode it to bytes to send it anywhere - files and pipes and sockets all take bytes. UTF-8 is the most common encoding to use for this purpose. — Grinnz, Feb 27 '19 at 20:59
@StefanBecker It's misleading to call it a "UTF-8" string. It is a decoded character string, it does not have an encoding. That Perl has to internally store it in an encoding is an implementation detail. — Grinnz, Feb 27 '19 at 21:01

score 1 · Answer 2 · answered Feb 27 '19 at 21:23

The serializeContent() method might do what you're after:

my $xml = '<doc>
  <glyph unicode="&#x2000;" />
</doc>';

my $dom = XML::LibXML->load_xml(
    string          => $xml,
    expand_entities => 0,
    no_network      => 1,
);

my($attr) = $dom->findnodes('//glyph[1]/@unicode');

say $attr->serializeContent();

Which outputs:

&#x2000;

I suspect, that the expand_entities option doesn't apply to numeric character entities. The documentation is unclear and I haven't looked at the source.

In the more common case where you do want all entities expanded and just want the actual characters that those entities represent, you don't even need to call getAttribute(). Each node object uses a tied hash interface so you can just do this:

my $text = $glyph->{unicode};

Use perl LibXML Element->getAttribute() without expanding unicode entities in value

2 Answers2