4

I may be asking a basic question but it's killing me.

Following is my code snippet

#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;


my $twig = new XML::Twig( twig_handlers => { TRADE => \&TRADE } );

$twig->parsefile('1510.xml');

$twig->set_pretty_print('indented');

$twig->print_to_file('out.xml');

sub TRADE {
    my ( $twig, $TRADE ) = @_;
    #added delete in place of cut
     $TRADE->cut($TRADE) unless
     $TRADE->att('origin') eq "COMPUTER";
}

This is working as expected. It is giving me all TRADES having 'origin' equals 'COMPUTER'.

But I need to handle XML files spanning to 1 GB. In that case it 'segmentation error' as it consumes huge memory.

Hence, in order to resolve the issue I am trying to implement 'purge' concept of XML::Twig

Hence I modified the code to :

#!/usr/bin/perl

    use strict;
    use warnings;
    use XML::Twig;


    my $twig = new XML::Twig( twig_handlers => { TRADE => \&TRADE } );

    $twig->parsefile('1510.xml');

    $twig->set_pretty_print('indented');

    $twig->print_to_file('out.xml');

    sub TRADE {
        my ( $twig, $TRADE ) = @_;
        #added delete in place of cut
         $TRADE->cut($TRADE) unless
         $TRADE->att('origin') eq "COMPUTER";

         $twig->purge; 
    }

This is giving me empty file. I am trying to flush those twigs which are used in order to use memory efficiently.

I don't know why it is giving me blank output file.

Sample XML :

<TRADEEXT>
 <TRADE origin = 'COMPUTER'/>
 <TRADE origin = 'COMP'/>
 <TRADE origin = 'COMPP'/>  
</TRADEEXT>

output file:

<TRADEEXT>
 <TRADE origin = 'COMPUTER'/>
</TRADEEXT>
Jim Davis
  • 5,241
  • 1
  • 26
  • 22
karan arora
  • 176
  • 9

1 Answers1

6

You should probably use flush (to a filehandle) instead of purge: flush outputs the twig that has been parsed so far and frees the memory, while purge only frees the memory.

That said, if all you want is to remove the TRADE elements that don't have the proper attribute, you could do something like this:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;

open( my $out, '>:utf8', "out.xml") or die "cannot create output file out.xml: $!";

my $twig = XML::Twig->new( pretty_print => 'indented',
                           twig_roots => { 'TRADE[@origin != "COMPUTER"]' 
                                              => sub { $_->delete; } 
                                         },
                           twig_print_outside_roots => $out,
                         )
                            
                    ->parsefile('1510.xml');

This will leave some extra empty lines in the file, you can remove them later. The twig_roots handler is triggered for all elements you need to remove, and it deletes them, while the twig_print_outside_roots option causes all other elements to be printed as_is.

toolic
  • 57,801
  • 17
  • 75
  • 117
mirod
  • 15,923
  • 3
  • 45
  • 65
  • great insight. But my requirement is that the original file should remain untouched. The output file should have all legitimate TRADES. Can you please advise. Also, i don;t want any output to be reflected on screen. That's why I used PURGE. – karan arora Feb 09 '15 at 15:56
  • I think that's what the code I posted does. Did you try it? – mirod Feb 09 '15 at 17:29
  • your code is giving right output. Sorry for being novice. But can you please tell me why my purge is not working. Please. – karan arora Feb 09 '15 at 17:45
  • from the docs, `purge`: *...deletes all elements that have been completely parsed so far.* so all the TRADE elements that you have kept are purged from the memory. You really need to use `flush`, likely to a filehandle, if you want to output the TRADE elements that you want to keep. – mirod Feb 09 '15 at 17:53
  • One question, If I need to add one more condition at twig_roots level...How can I do? For example apart from the origin not equal to COMPUTER which is mentioned above, say there is one more condition that tag which is inside tag should have some specific attribute value. Then how to do that?? – karan arora Feb 10 '15 at 18:16
  • you want to delete only the `TRADE` elements that include an `event` element with a specific attribute vallue? The in the handler do `if( $_->descendants( 'event[@att="val"]') { $_->delete; } else { $_->flush( $out); }` – mirod Feb 10 '15 at 18:26