0

I have legacy XML documents that contain nested (non-root) elements that I want to validate against an XML Schema. The schema itself does not describe the XML document as a whole, but only a particular nested element.

The XML document resembles a message received from a 3rd party system, has no xmlns attributes, and even no XML processing instruction. It's a legacy thing that I cannot influence. Example:

<XM>
    <MH> … nested header elements … </MH>
    <MD>
        <RECSET>
            … payload elements go here …
        </RECSET>
   </MD>
</XM>

My aim is to validate /XM/MD/RECSET against an XSD which defines the RECSET element and any payload elements nested within. I do not have schemas that would describe the outer elements, i.e. XM, MH, MD. I could modify all existing schemas and add dummy definitions, e.g. allowing for xs:all, but that is not preferred.

The validation is an optional step in a processing pipeline, and I want to avoid unnecessarily repeated XML parsing and other processing which adds execution time (throughput is important).

Another constraint is that I want to use XmlDocument, because down the processing pipeline I need an XmlDocument instance to perform deserialization into an object model using XmlSerializer. Again, this is an existing solution that I want to preserve.

My attempt is as follows:

// build an XmlDocument instance as the intermediate format of the message
var xml = new XmlDocument();
xml.LoadXml(msg.TransportMessage);

// obtain a pre-cached XmlSchemaSet instance matching the message represented by XmlDocument
XmlSchemaSet schemaSet = … ;

// find the whole payload represented by the RECSET element
var nodeToValidate = xml.SelectSingleNode("/XM/MD/RECSET");

// attach schemas to the document and validate the payload node
xml.Schemas = xsd;
xml.Validate(ValidationCallback, nodeToValidate);

This results in an error:

Schema information could not be found for the node passed into Validate. The node may be invalid in its current position. Navigate to the ancestor that has schema information, then call Validate again.

I've looked into the implementation of XmlDocument and the DocumentSchemaValidator class, which, in case of specific node validation, searches the DOM for schema information. Hence I tried attaching a reference to the correct schema to the node ad hoc:

XmlAttribute noNamespaceAttribute =  xml.CreateAttribute("xsi:noNamespaceSchemaLocation", "http://www.w3.org/XMLSchema-instance");
foreach (XmlSchemaElement x in schemaSet.GlobalElements.Values)
{
    if (x.Name == "RECSET")
    {
        noNamespaceAttribute.InnerText = x.SourceUri!;
        break;
    }
}
nodeToValidate.Attributes!.Append(noNamespaceAttribute);

However, that results in the very same error message.

A working way to achieve such validation is to take the nodeToValidate.OuterXml and parse it either using a validating XmlReader or a new XmlDocument instance. However, that leads to another overhead in terms of memory and CPU. I'd rather avoid this route.

Is there a way to tell the validation engine to validate a particular node against an explicitly specified schema?

Ondrej Tucny
  • 27,626
  • 6
  • 70
  • 90
  • Is the schema embedded in the XML file? in some file on disk? Available via the web? – dbc Sep 06 '22 at 16:14
  • According to the docs for [`XmlDocument.Validate(ValidationEventHandler validationEventHandler, XmlNode nodeToValidate)`](https://docs.microsoft.com/en-us/dotnet/api/system.xml.xmldocument.validate?view=net-6.0#system-xml-xmldocument-validate(system-xml-schema-validationeventhandler-system-xml-xmlnode)): *The Validate method performs infoset augmentation. Specifically, after successful validation, schema defaults are applied, text values are converted to atomic values as necessary, and type information is associated with validated information items.* Do you need **infoset augmentation**? – dbc Sep 06 '22 at 16:27
  • If you don't need to do infoset augmentation and only need to validate, you can do so using an `XmlNodeReader` created from `nodeToValidate`, see my [mcve] here: https://dotnetfiddle.net/Zdi3Fu and fixed using `XmlNodeReader` here: https://dotnetfiddle.net/NFuK2t. Please [edit] your question to clarify your requirements. – dbc Sep 06 '22 at 18:25
  • 1
    @dbc No, schemas are in external files. I'm building a `XmlSchemaSet` instance separately. The doc itself is 'plain XML'. The second dotnetfiddle seems like the way to go! – Ondrej Tucny Sep 06 '22 at 19:43

1 Answers1

0

Your problem is that XmlDocument.Schemas is intended to represent the schema for the entire document:

The schemas contained in an XmlSchemaSet object associated with an XmlDocument object are used for validation when the Validate method of an XmlDocument is executed.

In your case you have no schema for the entire document, so when you attempt to validate a particular node of the document by setting XmlDocument.Schemas to be the schema for that child node, validation fails, perhaps because the validation code is unable to navigate through the root document's schema (which doesn't exist) to find the specific child schema for the child element to be checked.

Options for a workaround depend on what you are trying to accomplish when you call XmlDocument.Validate(ValidationEventHandler, XmlNode). As explained in the docs, this method actually performs two distinct but related actions:

  1. As expected, it validates the XML data in the XmlNode object against the schemas contained in the Schemas property.

  2. It also performs infoset augmentation:

    The Validate method performs infoset augmentation. Specifically, after successful validation, schema defaults are applied, text values are converted to atomic values as necessary, and type information is associated with validated information items. The result is a previously un-typed XML sub-tree in the XmlDocument replaced with a typed sub-tree.

Action #1 seems clear, but what exactly is infoset augmentation? This isn't clearly documented, but one effect is to populate the contents of XmlNode.SchemaInfo. For instance, using the XML and XSD from https://www.w3schools.com/xml/schema_example.asp as an example, if I validate the root element against the XSD and check the contents of DocumentElement.SchemaInfo before and after as follows:

var nodeToValidate = xml.DocumentElement;

Console.WriteLine("DocumentElement.SchemaInfo before: {0}", new { nodeToValidate?.SchemaInfo.IsDefault,  nodeToValidate?.SchemaInfo.IsNil, nodeToValidate?.SchemaInfo.MemberType, nodeToValidate?.SchemaInfo.SchemaAttribute, nodeToValidate?.SchemaInfo.SchemaElement, nodeToValidate?.SchemaInfo.SchemaType, nodeToValidate?.SchemaInfo.Validity });

// attach schemas to the document and validate the payload node
xml.Schemas = schemaSet;
xml.Validate(ValidationCallback, nodeToValidate);

Console.WriteLine("DocumentElement.SchemaInfo after:  {0}", new { nodeToValidate?.SchemaInfo.IsDefault,  nodeToValidate?.SchemaInfo.IsNil, nodeToValidate?.SchemaInfo.MemberType, nodeToValidate?.SchemaInfo.SchemaAttribute, nodeToValidate?.SchemaInfo.SchemaElement, nodeToValidate?.SchemaInfo.SchemaType, nodeToValidate?.SchemaInfo.Validity });

The result clearly shows that DocumentElement.SchemaInfo has been populated.

DocumentElement.SchemaInfo before: { IsDefault = False, IsNil = False, MemberType = , SchemaAttribute = , SchemaElement = , SchemaType = , Validity = NotKnown }
DocumentElement.SchemaInfo after:  { IsDefault = False, IsNil = False, MemberType = , SchemaAttribute = , SchemaElement = System.Xml.Schema.XmlSchemaElement, SchemaType = System.Xml.Schema.XmlSchemaComplexType, Validity = Valid }

Demo fiddle #1 here.

Further, it seems that XmlDocument.Validate(ValidationEventHandler, XmlNode) may actually insert additional nodes into the document, see XmlDocument.NodeInserted triggered on XmlDocument.Validate() for one such example.

But do you really need to modify your XmlDocument via infoset augmentation, or do you just need to perform a read-only validation?

If you don't need infoset augmentation, you may validate an XmlNode by constructing an XmlNodeReader from it and then using the reader for read-only validation. First introduce the following extension methods:

public static class XmlNodeExtensions
{
    public static void Validate(this XmlNode node, XmlReaderSettings settings)
    {
        if (node == null)
            throw new ArgumentNullException(nameof(node));
        using (var innerReader = new XmlNodeReader(node))
        using (var reader = XmlReader.Create(innerReader, settings))
        {
            while (reader.Read())
                ;
        }
    }

    public static void Validate(this XmlNode node, XmlSchemaSet schemaSet, XmlSchemaValidationFlags validationFlags, ValidationEventHandler validationEventHandler)
    {
        if (node == null)
            throw new ArgumentNullException(nameof(node));
        var settings = new XmlReaderSettings();
        settings.ValidationType = ValidationType.Schema;
        settings.ValidationFlags |= validationFlags;
        if (validationEventHandler != null)
            settings.ValidationEventHandler += validationEventHandler;
        settings.Schemas = schemaSet;
        node.Validate(settings);
    }
}

And now you will be able to do:

XmlSchemaSet schemaSet = ...

var nodeToValidate = xml.SelectSingleNode("/XM/MD/RECSET");

nodeToValidate.Validate(schemaSet, default, ValidationCallback);

Note that you never set XmlDocument.Schemas with this approach. Demo fiddle #2 here.

If you do need infoset augmentation you will need to rethink your approach, possibly by programmatically generating a plausible XmlSchema for the <XM><MD>...</MD></XM> wrapper elements in runtime and adding it to XmlDocument.Schemas before validation.

dbc
  • 104,963
  • 20
  • 228
  • 340