4

I'm using XML::Twig module to remove all the comments from an XML file. The sample file can be -

<?xml version="1.0" encoding="UTF-8"?>
<Node_A>
node A content 1
<!-- One Line Comment A1-->
<![CDATA[this portion within the two comments is being
REMOVED which is not the intention]]>
<!-- Two Line Comment
Two Line Comment-->
node A content 3
<!-- Two Line Comment
Two Line Comment-->
<![CDATA[this portion within the two comments is being
REMOVED which is not the intention]]>
<!-- Two Line Comment
Two Line Comment-->
<![CDATA[
this portion is fine]]>

<Node_B> node B content
<Node_C> node c content
</Node_C>
<!-- One Line Comment -->
some data one
<!-- Multi  Line Comment
Line 3Comment
1Line Comment
2Line Comment
Line 5Comment
Line Comment-->
some data again two 
<!-- Multi  Line Comment
Line 3Comment
Line 5Comment
Line Comment-->

few more
</Node_B>

</Node_A>

I have used the script like -

#!/usr/bin/perl 

use strict;
use warnings;
use XML::Twig;
my $infile = 'demo.xml';
my $twig = XML::Twig->new (comments => 'drop', pretty_print => 'indented')->parsefile($infile);
$twig->print ();

This script is removing the "CDATA" portion within the two comments which is not my intention. The output is coming as-

<?xml version="1.0" encoding="UTF-8"?>
<Node_A>
node A content 1

<![CDATA[
this portion is fine]]><Node_B> node B content
<Node_C> node c content
</Node_C>

some data one

some data again two 


few more
</Node_B></Node_A>

What I have to add to keep all the CDATA portion and other stuff as it is, just to remove the comments?

Thanks in advance.

mu is too short
  • 426,620
  • 70
  • 833
  • 800

1 Answers1

4

When I run your script with the demo.xml file you posted, I get the output:

<?xml version="1.0" encoding="UTF-8"?>
<Node_A>
node A content 1

<![CDATA[this portion within the two comments is being
REMOVED which is not the intention]]>

node A content 3

<![CDATA[this portion within the two comments is being
REMOVED which is not the intention]]><![CDATA[
this portion is fine]]><Node_B> node B content
<Node_C> node c content
</Node_C>

some data one

some data again two


few more
</Node_B></Node_A>

Which looks ok to me. I suspect you have a buggy version of XML::Twig (or XML::Parser, which it depends on). I'm using Perl 5.14.2, XML::Twig 3.35, and XML::Parser 2.41.

cjm
  • 61,471
  • 9
  • 126
  • 175
  • Same here, the code runs fine. I doubt it's a bug in either module though, AFAIK the code handling comments hasn't changes in years. – mirod Nov 15 '11 at 06:54
  • You are absolutely right. My version of Twig was pretty old (3.13) and now its working as expected after installing the current version. – Soumava Roy Nov 16 '11 at 07:24
  • Could you please tell me whether the script will depend on XML file size or not? Actually for large XML files (more than 1000 lines) will the script work or not in removing comments? – Soumava Roy Nov 16 '11 at 07:29