3

Basically I need to use the schema option from the perl module XML::libXML::Reader in order to validate a large (>1GB) XML file as the file is parsed.

Previously I have used the xmllint command to validate an XML file against a given schema (xsd) file. However now I have some large XML files to validate and am running out of memory (8GB) trying to perform the validation.

I have read on the XML::libXML::Reader perl module page that there is a schema option. However, when I use it (see code below) the code exits when the first invalidate element of the XML file is found.

use strict;
use warnings;
use XML::LibXML::Reader;

my $SchemaFile='schema.xsd';
my $FileToAnalyse='/tmp/file.xml';

my $reader = XML::LibXML::Reader->new(location => $FileToAnalyse,Schema=>$SchemaFile) or 
die "cannot read file '$FileToAnalyse': $!\n";

while($reader->read) {

    Process the file line by line here, even if not valid against schema (reduces memory usage for large files)
}

I need to collect the invalid entries and continue rather than exiting. Is this possible?

Chazg76
  • 619
  • 5
  • 10
  • 1
    Swallowing the XML::LibXML::Error exception appears to put `$reader` into an invalid state. The [spec](https://w3.org/TR/xml/#sec-terminology) says the parser MAY continue. – daxim Oct 18 '19 at 10:38
  • 1
    Try this tutorial https://culturedperl.com/perl-5-xml-validation-with-dtd-and-xsd-ec2d90f7c434 – Dragos Trif Oct 20 '19 at 18:55

2 Answers2

4

The reason $reader->read does not recover from schema validation errors (even if recovery could be possible) can be seen at line #8815 of LibXML.xs. Notice that REPORT_ERROR() is called with a zero value (the value indicates whether `LibXML_report_error_ctx() will be able to recover from errors or not. A value of zero, means it will not try to recover, and it will call XML::LibXML::Error::_report_error to die.

I tried to change the value to 1 at line #8815 and recompiled the XS module, and now it reported the schema errors as warnings (instead of dying) and continued the parsing.

I guess there is a good reason why this option is not made available to the user, but I am not so familiar with XML parsing that I can give an example of what could go wrong here.

Edit:

It seems that the correct approach is to catch the exceptions thrown by read(), then try to call read() another time, if the following call to read() returns -1, the parser was not able to recover from the error, if it returns 0, end-of-file was reached, and if it returns 1 it was able to recover from the exception. I did some testing and it seems it is able to recover from schema validation errors, but not from parsing errors. So you could try the following:

use feature qw(say);
use strict;
use warnings;

use Try::Tiny qw(try catch);
use XML::LibXML::Reader;

my $SchemaFile='schema.xsd';
my $FileToAnalyse='file.xml';
my $reader = XML::LibXML::Reader->new(
    location => $FileToAnalyse, Schema => $SchemaFile
) or die "cannot read file '$FileToAnalyse': $!\n";
while (1) {
    my $result;
    try { $result = $reader->read } catch {
        say '==> ' . $_;
        $result = 1;  # Try to continue after exception..
    };
    last if $result != 1;
    if ( $reader->nodeType == XML_READER_ELEMENT ) {
        say "Element node: ", $reader->name;
    }
}
$reader->finish();
$reader->close();
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
1

OK, not exactly what I originally asked but I have found a solution if anyone is interested. I have simply used the --stream switch for the xmllint command.

This allows me to validate XML files >1GB on a system with 4GB of ram (Without the --stream switch this was not possible). The method generates a list of entries, if they exist, that do not conform to the supplied XSD file (these can be written to a file or the terminal). The important point for me is that xmllint does not stop when it finds the first non-conformity but rather continues to the end of the XML file printing any non-conformaties as it goes.

Chazg76
  • 619
  • 5
  • 10
  • 1
    I worked with a dev team who were implementing a validating XML parser. When I asked for a 'warn only' feature, they explained that it is not possible, in general, to provide this except in specific cases. Once you depart from the 'grammar' of the schema, it is often unclear how the XML processor should continue. The exception is validating XSD facets on attribute values - those errors are always survivable because they only involve the value of a single attribute. – kimbert Oct 23 '19 at 13:03