1

I am trying to parse XML using XML::LibXML module. The XML data structure has node called <row> which encloses two child nodes <key> and <value>. I want to parse each of these <row> and build a hash data structure. I could come up with below code to achieve it but I feel there would be a better way to do it.

use strict;
use warnings;

use Data::Dumper;
use XML::LibXML;

my $XML=<<EOF;
<config>
    <row>
        <key>
            <A1>alpha</A1>
            <A2>beta</A2>
            <A3>cat</A3>
            <A4>delta</A4>
        </key>
        <value>
            <B1>eclipse</B1>
            <B2>pico</B2>
            <B3>penta</B3>
            <B4>zeta</B4>
        </value>
    </row>
    <row>
        <key>
            <A1>tom</A1>
            <A2>harry</A2>
            <A3>bob</A3>
            <A4>ben</A4>
        </key>
        <value>
            <B1>TAP</B1>
            <B2>MAN</B2>
            <B3>WORK</B3>
            <B4>MAINTAIN</B4>
        </value>
    </row>
</config>
EOF

my $parser = XML::LibXML->new();
my $doc  = $parser->parse_string($XML);

my %hash;
my $i = 1;

foreach my $node ($doc->findnodes('/config/row/key')) {
    foreach my $tag ('A1', 'A2','A3','A4') {
        $hash{'KEY' . $i}{$tag} = $node->findvalue( $tag );
    }
    $i++;
}

$i = 1;

foreach my $node ($doc->findnodes('/config/row/value')) {
    foreach my $tag ('B1', 'B2','B3','B4') {
        $hash{'KEY' . $i}{$tag} = $node->findvalue( $tag );
    }
    $i++;
}

print Dumper \%hash;

Output

$VAR1 = {
          'KEY2' => {
                      'A3' => 'bob',
                      'B3' => 'WORK',
                      'B1' => 'TAP',
                      'A1' => 'tom',
                      'B4' => 'MAINTAIN',
                      'B2' => 'MAN',
                      'A2' => 'harry',
                      'A4' => 'ben'
                    },
          'KEY1' => {
                      'A3' => 'cat',
                      'B3' => 'penta',
                      'B1' => 'eclipse',
                      'A1' => 'alpha',
                      'B4' => 'zeta',
                      'B2' => 'pico',
                      'A2' => 'beta',
                      'A4' => 'delta'
                    }
        };

Actually, instead of creating imaginary keys ( KEY1 , KEY2 .. ) , I would like to have <A1> node's value to be considered as key for each section. Can someone please help me out here.

Desired output:

'tom'   => {
             'A3' => 'bob',
             'B3' => 'WORK',
             'B1' => 'TAP',

             'B4' => 'MAINTAIN',
             'B2' => 'MAN',
             'A2' => 'harry',
             'A4' => 'ben'
           },
'alpha' => {
             'A3' => 'cat',
             'B3' => 'penta',
             'B1' => 'eclipse',

             'B4' => 'zeta',
             'B2' => 'pico',
             'A2' => 'beta',
             'A4' => 'delta'
           }
Ken Y-N
  • 14,644
  • 21
  • 71
  • 114
chidori
  • 1,052
  • 3
  • 12
  • 25

2 Answers2

2

"I would like to have <A1> node's value to be considered as key for each section"

This solution creates a hash for each row element and pushes it onto the @rows array. Unlike the original it reads the XML data from a file called config.xml

The tags for the A* and B* elements are ignored -- it is simply assumed that the keys and values are in the same order

The main loop iterates over the row elements, and for each row, a list of the key and value child elements is converted to their text values with a map. Then a hash is built and pushed onto the array

I've used Data::Dump to display the resulting data structure as I believe it is far superior to Data::Dumper

use strict;
use warnings;

use XML::LibXML;

my $doc = XML::LibXML->load_xml( location => 'config.xml' );

my @rows;

for my $row ($doc->findnodes('/config/row')) {

    my @keys   = map $_->textContent, $row->findnodes('key/*');
    my @values = map $_->textContent, $row->findnodes('value/*');

    my %row;
    @row{@keys} = @values;
    push @rows, \%row;
}

use Data::Dump;
dd \@rows;

output

[
  { alpha => "eclipse", beta => "pico", cat => "penta", delta => "zeta" },
  { ben => "MAINTAIN", bob => "WORK", harry => "MAN", tom => "TAP" },
]

Update

Here's a variation that complies with your desired output. Thanks to choroba for pointing it out to me

It's a very similar approach to my original one above, but it builds a hash instead of an array and uses the elements' tag names as keys instead of the key/value relationship that I guessed you would want

I should say that I'm very doubtful about your choice of data structure; for instance, I see no need to exclude the A1 key from the subsidiary hash just because its value is used to identify the row. I would also be surprised if it wouldn't be better to use the key and value strings as keys and values. But it may also be that the XML tag names are badly chosen and your choice is optimal, and I have no way of knowing

Here's the Perl code. which reads from the config.xml file as before. If you would prefer to keep the A1 hash element as I described then you can just change the elsif to an if and it will happen

use strict;
use warnings;

use XML::LibXML;

my $doc = XML::LibXML->load_xml( location => 'config.xml' );

my ( %data, $section);

for my $row ( $doc->findnodes('/config/row') ) {

    for my $item ( $row->findnodes('key/* | value/*') ) {

        my ($key, $val) = ( $item->tagName, $item->textContent );

        if ( defined $section ) {
            $data{$section}{$key} = $val
        }
        else {
            $section = $val;
        }
    }
}

use Data::Dump;
dd \%data;

output

{
  alpha => {
    A2 => "beta",
    A3 => "cat",
    A4 => "delta",
    B1 => "eclipse",
    B2 => "pico",
    B3 => "penta",
    B4 => "zeta",
  },
  tom => {
    A2 => "harry",
    A3 => "bob",
    A4 => "ben",
    B1 => "TAP",
    B2 => "MAN",
    B3 => "WORK",
    B4 => "MAINTAIN",
  },
}
Community
  • 1
  • 1
Borodin
  • 126,100
  • 9
  • 70
  • 144
1

The first XPath expression selects the A1s, the second one selects all the A* and B* in the same row (except the A1 itself).

#! /usr/bin/perl
use warnings;
use strict;

use XML::LibXML;

my $xmlstring = << '__XML__';
<config>
    ...
</config>
__XML__

my $xml = 'XML::LibXML'->load_xml(string => $xmlstring);
my $root = $xml->documentElement;

my %hash;
for my $a1 ($root->findnodes('/config/row/key/A1')) {
    for my $node ($a1->findnodes('(../../key/*[not(self::A1)] | ../../value/*)')) {
        $hash{ $a1->textContent }{ $node->getName } = $node->textContent;
    }
}

use Data::Dump;
dd \%hash;

output

{
  alpha => {
    A2 => "beta",
    A3 => "cat",
    A4 => "delta",
    B1 => "eclipse",
    B2 => "pico",
    B3 => "penta",
    B4 => "zeta",
  },
  tom => {
    A2 => "harry",
    A3 => "bob",
    A4 => "ben",
    B1 => "TAP",
    B2 => "MAN",
    B3 => "WORK",
    B4 => "MAINTAIN",
  },
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
choroba
  • 231,213
  • 25
  • 204
  • 289
  • 1
    Why the quotes in `'XML::LibXML'->load_xml` ? I'm sure you know that they're optional n a class method call. I also think you need some narrative – Borodin Aug 03 '15 at 21:39
  • @Borodin: regarding the quotes, see http://stackoverflow.com/a/16656174/1030675. They are nices than `XML::LibXML::`, but a bit less powerful. – choroba Aug 03 '15 at 21:44
  • @Borodin: The code was tested, please don't change it in a way that changes its output. – choroba Aug 03 '15 at 21:46
  • @Borodin: Here's where I learned to quote class names: http://www.perlmonks.org/?node_id=980498 – choroba Aug 03 '15 at 22:04
  • I apologise for my edit, but after all your original solution has no output. I've added just a dump and the corresponding output. I hope you can see what I meant by my original changes? – Borodin Aug 03 '15 at 22:58
  • @Borodin: I don't. What I see is `is_deeply` same as the expected output in the question. – choroba Aug 03 '15 at 23:03
  • The OP has only his *actual* output, and it includes `A1` as a key. Your code explicitly excludes it – Borodin Aug 03 '15 at 23:04
  • @Borodin: So what's that "**Desired output:**"? – choroba Aug 03 '15 at 23:10
  • Ah I see my misunderstanding. I overlooked `**Desired output**` because of the bad markdown. Your solution satisfies that admirably – Borodin Aug 03 '15 at 23:12
  • *"Here's where I learned to quote class names"* I think I prefer to keep to the rule that lexical identifiers should never contain upper-case letters. That fixes a few other things as well and makes the code more readable – Borodin Aug 05 '15 at 07:00