6

Given the following XML snippet:

<outline>
  <node1 attribute1="value1" attribute2="value2">
    text1
  </node1>
</outline>

How do I get this output?

outline
node1=text1
node1 attribute1=value1
node1 attribute2=value2

I have looked into use XML::LibXML::Reader;, but that module appears to only provide access to attribute values referenced by their names. And how do I get the list of attribute names in the first place?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118

2 Answers2

6

Something like this should help you.

It's not clear from your question whether <outline> is the root element of the data, or if it is buried somewhere in a bigger document. It's also unclear how general you want the solution to be - e.g. do you want the entire document dumped in this manner?

Anyway, this program generates the output you requested from the given XML input in a fairly concise manner.

use strict;
use warnings;
use 5.014;     #' For /r non-destructive substitution mode

use XML::LibXML;

my $xml = XML::LibXML->load_xml(IO => \*DATA);

my ($node) = $xml->findnodes('//outline');

print $node->nodeName, "\n";

for my $child ($node->getChildrenByTagName('*')) {
  my $name = $child->nodeName;

  printf "%s=%s\n", $name, $child->textContent =~ s/\A\s+|\s+\z//gr;

  for my $attr ($child->attributes) {
    printf "%s %s=%s\n", $name, $attr->getName, $attr->getValue;
  }
}

__DATA__
<outline>
  <node1 attribute1="value1" attribute2="value2">
    text1
  </node1>
</outline>

output

outline
node1=text1
node1 attribute1=value1
node1 attribute2=value2
Borodin
  • 126,100
  • 9
  • 70
  • 144
5

You find the list of attributes by doing $e->findnodes( "./@*");

Below is a solution, with plain XML::LibXML, not XML::LibXML::Reader, that works with your test data. It may be sensitive to extra whitespace and mixed-content though, so test it on real data before using it.

#!/usr/bin/perl

use strict;
use warnings;

use XML::LibXML;

my $dom= XML::LibXML->load_xml( IO => \*DATA);
my $e= $dom->findnodes( "//*");

foreach my $e (@$e)
  { print $e->nodeName;

    # text needs to be trimmed or line returns show up in the output
    my $text= $e->textContent;
    $text=~s{^\s*}{};
    $text=~s{\s*$}{};

    if( ! $e->getChildrenByTagName( '*') && $text)
      { print "=$text"; }
    print "\n"; 

    my @attrs= $e->findnodes( "./@*");
    # or, as suggested by Borodin below, $e->attributes

    foreach my $attr (@attrs)
      { print $e->nodeName, " ", $attr->nodeName. "=", $attr->value, "\n"; }
  }
__END__
<outline>
  <node1 attribute1="value1" attribute2="value2">
    text1
  </node1>
</outline>
mirod
  • 15,923
  • 3
  • 45
  • 65
  • 3
    There are much cleaner ways to fetch the attributes. The obvious is `my @attrs = $e->attributes`, which returns a list of all attribute nodes, but an element node object also behaves as a tied hash reference, and `keys %$e` will return all of the attribute names while `$e->{attr_name}` will return the value of attribute `attr_name`. – Borodin Nov 07 '14 at 09:43
  • thanks, I didn't find this in the docs, which I thought was strange. And now I see it, under "Overloading", duh! I still don't see `attributes` though, at least in the docs for `XML::LibXML::Element` – mirod Nov 07 '14 at 11:03
  • I see, I wasn't expecting to find it there. Actually it makes no sense at all. I see that it is also used to return the list of namespace declarations associated with the node, WTF? Why 1 method for 2 extremely different results? I can't even find it in the DOM spec... Boy I'm glad I use XML::Twig ;--) – mirod Nov 07 '14 at 11:21
  • The border between `XML::LibXML::Element` and `XML::LibXML::Node` is a little strange. I would expect all attribute stuff to appear in the former as no other node type can have attributes. But the namespace declarations is kinda okay: a namespace looks just like an attribute called `xmlns`. – Borodin Nov 07 '14 at 11:32
  • agreed, indeed with `findnodes( "./@*")` (ir using `%$e`) you don't get the namespace declarations, while `attributes` gives them to you. And before testing, I thought that `attributes` would return a list of all namespace declarations that applied to a node, not just the ones declared in the start tag of the element. – mirod Nov 07 '14 at 12:30
  • It has been on my list of things to do -- towards the bottom, in the section marked "interesting" -- to examine and understand the [libxml2 library](http://xmlsoft.org/) on which this is based: exercises like that always enhance my understanding of related software. I hope to find that strangenesses like this one in the Perl glue library are mainly due to our vision being forced through the fat lenses of the author's spectacles. – Borodin Nov 07 '14 at 16:57
  • Thank you very much! I like both solutions: Borodin's for the use of `attributes` and mirod's for unifying approach to nodes walking with `findnodes( "//*")`. (Sorry, my question was badly composed, the `` is basically an ordinary node, just like ``, so what I really needed was a recursive walk over the whole document.) You've done a good job at clarifying the Perl docs too ;) – Alexander Shcheblikin Nov 08 '14 at 00:28