0

I have a task to correct the syntax of xml files if they are not well formatted. Currently I am doing this task manually. Is there any way to validate the syntax of XML file, if XML is not well formatted then correct it to well formatted?

Is it possible to validate and correct with Perl script?

Thanks,

MangeshBiradar
  • 3,820
  • 1
  • 23
  • 41
  • 1
    There is no algorithm that could reliably take arbitrary non-well-formed XML and produce well-formed XML that reflected the original intent. It is easy to construct non-well-formed XML that could have many different interpretations... how does the correction code decide which one is right? What you are asking for is equivalent to writing a Java (or Perl) compiler that accepts invalid code and "corrects" it. If we had that, syntax errors would be a thing of the past. – Jim Garrison Feb 12 '13 at 06:32
  • Thanks Jim. That make sense. Can't we check whether opening and closing tags are correct, if not then correct them? – MangeshBiradar Feb 12 '13 at 06:36
  • Check whether what...? – Jim Garrison Feb 12 '13 at 06:38
  • @Maverick143 — No. Reread Jim's original comment again. – Quentin Feb 12 '13 at 07:19
  • @JimGarrison: On the other hand, all significant web browsers make a good attempt to correct malformed XHTML. – Borodin Feb 12 '13 at 08:15

2 Answers2

3

XML::LibXML is a validating parser. You can use it to determine if the XML is valid.

use XML::LibXML qw( );
my $parser = XML::LibXML->new();
if (eval { $parser->parse_file($qfn) }) {
   print "ok\n";
} else {
   print "error:\n$@";
}

Automatically correcting XML is another matter. It's impossible to automatically fix bad XML without making huge assumptions. For example, there's no way to know whether

<foo>/bar<baz/</foo>

was meant to be

<foo>/bar&lt;baz/</foo>

or

<foo>/bar<baz/></foo>

or even something else.

XML::LibXML does have an option to automatically fix/ignore some errors. Who knows if it makes the same assumption you do. Use

use XML::LibXML qw( );
my $parser = XML::LibXML->new( recover => $recover );
my $doc = $parser->parse_file($in_qfn);
$doc->toFile($out_qfn);

Use 1 for $recover if you want the parser to be warn when it fixes a problem.
Use 2 for $recover if you want the parser to fix problems silently.
No matter what you use for $recover, it will still throw an exception if it encounters an unrecoverable error.

ikegami
  • 367,544
  • 15
  • 269
  • 518
2

You could try XML::Liberal: "Super liberal XML parser that parses broken XML", and see if it works for you.

mirod
  • 15,923
  • 3
  • 45
  • 65