perl XML::Simple for repeated elements

Question

I have the following xml code

<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Aug 26, 2013 10:02:03 +0900 (GMT+09:00) -->
<pathway name="path:ko01200" >
    <reaction id="14" name="rn:R01845" type="irreversible">
        <substrate id="108" name="cpd:C00447"/>
        <product id="109" name="cpd:C05382"/>
     </reaction>
    <reaction id="15" name="rn:R01641" type="reversible">
        <substrate id="109" name="cpd:C05382"/>
        <substrate id="104" name="cpd:C00118"/>
        <product id="110" name="cpd:C00117"/>
        <product id="112" name="cpd:C00231"/>
     </reaction>
</pathway>

I am trying to print the substrate id and product id with following code which I am stuck for the one that have more than one ID. Tried to use dumper to see the data structure but I don't know how to proceed. I have already used XML simple for the rest of my parsing script (this part is a small part of my whole script ) and I can not change that now

use strict;
use warnings;
use XML::Simple;
use Data::Dumper;
my $xml=new XML::Simple;
my $data=$xml->XMLin("test.xml",KeyAttr => ['id']);
print Dumper($data);
    foreach my $reaction ( sort  keys %{$data->{reaction}} ) {
        print $data->{reaction}->{$reaction}->{substrate}->{id}."\n"; 
        print $data->{reaction}->{$reaction}->{product}->{id}."\n";  

}

Here is the output

$VAR1 = {
      'name' => 'path:ko01200',
      'reaction' => {
                    '15' => {
                            'substrate' => {
                                           '104' => {
                                                    'name' => 'cpd:C00118'
                                                  },
                                           '109' => {
                                                    'name' => 'cpd:C05382'
                                                  }
                                         },
                            'name' => 'rn:R01641',
                            'type' => 'reversible',
                            'product' => {
                                         '112' => {
                                                  'name' => 'cpd:C00231'
                                                },
                                         '110' => {
                                                  'name' => 'cpd:C00117'
                                                }
                                       }
                          },
                    '14' => {
                            'substrate' => {
                                           'name' => 'cpd:C00447',
                                           'id' => '108'
                                         },
                            'name' => 'rn:R01845',
                            'type' => 'irreversible',
                            'product' => {
                                         'name' => 'cpd:C05382',
                                         'id' => '109'
                                       }
                          }
                  }
    };
 108
109
Use of uninitialized value in concatenation (.) or string at  line 12.
Use of uninitialized value in concatenation (.) or string at line 13.

My rule for XML::Simple is that the first time you have a question on how to use it, stop using it and move onto a better XML system. :) — brian d foy, Oct 01 '13 at 14:02
@briandfoy I wish I knew that before, actually I got the idea of using xml simple here on stack overflow.people encouraged me to use it — user1876128, Oct 01 '13 at 14:07

score 3 · Accepted Answer · answered Oct 01 '13 at 08:22

First of all, don't use XML::Simple. it is hard to predict what exact data structure it will produce from a bit of XML, and it's own documentation mentions it is deprecated.

Anyway, your problem is that you want to access an id field in the product and substrate subhashes – but they don't exist in one of the reaction subhashes

'15' => {
    'substrate' => {
         '104' => {
             'name' => 'cpd:C00118'
         },
         '109' => {
             'name' => 'cpd:C05382'
         }
     },
     'name' => 'rn:R01641',
     'type' => 'reversible',
     'product' => {
         '112' => {
             'name' => 'cpd:C00231'
         },
         '110' => {
             'name' => 'cpd:C00117'
         }
     }
 },

Instead, the keys are numbers, and each value is a hash containing a name. The other reaction has a totally different structure, so special-case code would have been written for both. This is why XML::Simple shouldn't be used – the output is just to unpredictable.

Enter XML::LibXML. It is not extraordinary, but it implememts standard APIs like the DOM and XPath to traverse your XML document.

use XML::LibXML;
use feature 'say'; # assuming perl 5.010

my $doc = XML::LibXML->load_xml(file => "test.xml") or die;

for my $reaction_item ($doc->findnodes('//reaction/product | //reaction/substrate')) {
  say $reaction_item->getAttribute('id');
}

Output:

Thanks for your answer , but I have already used XML simple for the rest of my parsing script (this part is a small part of my whole script ) and I can not change that now — user1876128, Oct 01 '13 at 08:50
@user1876128 The XML::Simple documentation lists various options that change how the resulting data structure is created – you may be able to find some combination that creates an uniform structure. I do not have that expertise, and prefer traversing XML using XPath – notice how short my final code is. In the long run, you won't regret leaving XML::Simple behind. — amon, Oct 01 '13 at 09:20

perl XML::Simple for repeated elements

1 Answers1