Generate Two Seperate XML Files by Walking DOM Tree and Using Attributes

Question

I need to segregate xml code into two separate files by using the optional attribute values. I prefer to use the XML::LibXML DOM methods, using Perl.

Sample XML code excerpt:
...
<LocalJMS>
  <Name>ZLAT</Name>
  <PrimaryConnection>
    <Address string="ops">ops.zla</Address>
    <Address string="spt">spt.zla</Address>
    <Port>77777</Port>
  </PrimaryConnection>
  <SecondaryConnection string="ops">
    <Address>abc.zla</Address>
    <Port>77777</Port>
  </SecondaryConnection>
</LocalJMS>

The desirable resulting two final xml files would be:

1.) OPS file:
...
 <LocalJMS>
   <Name>ZLAT</Name>
   <PrimaryConnection>
     <Address>ops.zla</Address>
     <Port>77777</Port>
   </PrimaryConnection>
   <SecondaryConnection>
     <Address>abc.zla</Address>
     <Port>77777</Port>
   </SecondaryConnection>
 </LocalJMS>

2.) SPT file:
...
 <LocalJMS>
   <Name>ZLAT</Name>
   <PrimaryConnection>
     <Address>spt.zla</Address>
     <Port>77777</Port>
   </PrimaryConnection>
 </LocalJMS>

I have no problem/issue removing the attributes prior to generating the two final xml files, nor do I have any issue with making a decision on a element with an attribute that has no child elements - I can handle that as far pumping the xml content to the correct final xml file when I walk the DOM tree and checking on the childnodes.

But the problem I'm encountering is when the attribute is defined within a child element (e.g. 'SecondaryConnection', which is a child of 'LocalJMS'). If I "walk" the DOM tree, I will first encounter the parent element 'LocalJMS', and I need some of it's children elements (e.g. 'Name', 'PrimaryConnection') to go to both final files, but then I only need the 'SecondaryConnection' element to go only to the OPS xml file (not the SPT file). [btw, the attribute is applicable to all child nodes, i.e. 'Address' & 'Port']

I'm looking for some ideas - maybe using parse_balanced_chunk or work from the deepest part of the originally xml file and work outwards, cycling thru each child node. I hate like heck to have to use traditional grep patterns etc and treat the xml file like a simple text file - I was hoping to take advantage of the DOM methods.

So what is the problem with yet another attribute? If you're OK with `Name` and `PrimaryConn..` what is wrong with `SecondaryConn..`? (And, by what criteria do you decide which goes where?) — zdim, Apr 24 '18 at 03:08
zdim - thank you for your attention... the criteria is the attribute values. When I walk the DOM, the top parent node contains a child node with an attribute - an attribute that is needed as criteria, but since the parent node is processed first (which contains a child node with an attribute), I end up processing the parent node without processing the child node yet. I won't be able to process the child node with attribute until I've already processed the parent node, which then will be too late. — CraigP, Apr 24 '18 at 13:32

score 0 · Answer 1 · answered Apr 24 '18 at 13:37

I suggest that you parse the original XML, and then for each value of the string attribute you can clone the whole document and remove all elements that have an attribute with any value for string different from the one required

It would look like this. I am sure you're able to alter the output to something more appropriate if necessary

use strict;
use warnings 'all';

use XML::LibXML;

my $dom = XML::LibXML->load_xml( location => 'sample.xml' );

for my $string ( qw/ ops spt / ) {

    print "\$string = $string\n\n";

    my $copy = $dom->cloneNode(1);

    for my $unwanted ( $copy->findnodes("//*[\@string != '$string']") ) {
        my $parent = $unwanted->parentNode;
        $parent->removeChild($unwanted);
    }

    print $copy, "\n\n---\n\n";
}

output

$string = ops

<?xml version="1.0"?>
<LocalJMS>
  <Name>ZLAT</Name>
  <PrimaryConnection>
    <Address string="ops">ops.zla</Address>

    <Port>77777</Port>
  </PrimaryConnection>
  <SecondaryConnection string="ops">
    <Address>abc.zla</Address>
    <Port>77777</Port>
  </SecondaryConnection>
</LocalJMS>


---

$string = spt

<?xml version="1.0"?>
<LocalJMS>
  <Name>ZLAT</Name>
  <PrimaryConnection>

    <Address string="spt">spt.zla</Address>
    <Port>77777</Port>
  </PrimaryConnection>

</LocalJMS>


---

[Finished in 0.8s]

Kjetil S. · Answer 2 · 2018-04-24T13:55:27.533

-2

It's not XML parsing like you wanted, but seems to work for your sample data. Tags with a string="filename" attribute goes only into only that file (where filename is uppercased), with the string attribute removed. And all other tags goes into all files:

my $input=join"",<DATA>;
my @string=$input=~/ string="(\w+)"/g;
for my $s (@string){
    my $output=$input;
    $output=~
      s{ (\s*) <(\w+)\s* ([^>]*?) string="(\w+)" (.*?</\2>) }
       { $4 eq $s ? "$1<$2$3$5" : ""                        }gsex;
    open my $FH, '>', uc($s) or die;
    print $FH $output;
    close($FH)
}
__DATA__
<LocalJMS>
  <Name>ZLAT</Name>
  <PrimaryConnection>
    <Address string="ops">ops.zla</Address>
    <Address string="spt">spt.zla</Address>
    <Port>77777</Port>
  </PrimaryConnection>
  <SecondaryConnection string="ops">
    <Address>abc.zla</Address>
    <Port>77777</Port>
  </SecondaryConnection>
</LocalJMS>

Output:

$ cat OPS 
<LocalJMS>
  <Name>ZLAT</Name>
  <PrimaryConnection>
    <Address>ops.zla</Address>
    <Port>77777</Port>
  </PrimaryConnection>
  <SecondaryConnection>
    <Address>abc.zla</Address>
    <Port>77777</Port>
  </SecondaryConnection>
</LocalJMS>

$ cat SPT
<LocalJMS>
  <Name>ZLAT</Name>
  <PrimaryConnection>
    <Address>spt.zla</Address>
    <Port>77777</Port>
  </PrimaryConnection>
</LocalJMS>

edited Apr 24 '18 at 13:55

answered Apr 24 '18 at 10:49

Kjetil S.

3,468
20
22

Kjetil S. - I thank you for the response ... although I understand exactly what you are suggesting, I'm a little hung-up on your line: s{ (\s*) <(\w+)\s* ([^>]*?) string="(\w+)" (.*?\2>) } { $4 eq $s ? "$1<$2$3$5" : "" }grsex; I can't seem to use the "r" modifier (i.e., can only use "gsex"). But with that said, when I attempt to debug this stmt by printing the 1st capture group $1, I'm getting empty value. – CraigP Apr 24 '18 at 13:20
@CraigP Aha, the `/r` modifier needs Perl version >= 5.14 (from 2011). I have rewritten my answer to not use `/r` now. The `$1` can be empty or it can contain `\s` chars, that is spaces, tabs, newlines and such. – Kjetil S. Apr 24 '18 at 13:57
The /r non-destructive option is not available until v5.14, I'm running v5.10.1 – CraigP Apr 24 '18 at 14:05
I never used the "e" 'evaluate the right-hand side as an expression' modifier - pretty cool ... you given me something I hopefully can work off of ...ty! – CraigP Apr 24 '18 at 15:19

Generate Two Seperate XML Files by Walking DOM Tree and Using Attributes

2 Answers2

output