9

According to the lxml documentation "The DTD is retrieved automatically based on the DOCTYPE of the parsed document. All you have to do is use a parser that has DTD validation enabled."

http://lxml.de/validation.html#validation-at-parse-time

However, if you want to validate against an XML schema, you need to explicitly reference one.

I am wondering why this is and would like to know if there is a library or function that can do this. Or even an explanation of how to make this happen myself. The problem is there seems to be many ways to reference an XSD and I need to support all of them.

Validation is not the issue. The issue is how to determine the schemas to validate against. Ideally this would handle inline schemas as well.

Update:

Here is an example.

simpletest.xsd:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="name" type="xs:string"/>
</xs:schema>

simpletest.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<name xmlns="http://www.example.org"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.example.org simpletest.xsd">foo</name>

I would like to do something like the following:

>>> parser = etree.XMLParser(xsd_validation=True)
>>> tree = etree.parse("simpletest.xml", parser)
Jono
  • 1,690
  • 2
  • 18
  • 29
  • We can't tell you how to deal with your own formats. – Marcin Mar 23 '12 at 17:38
  • Marcin, I do not understand your comment. Perhaps I dont understand how schema validation works. – Jono Mar 23 '12 at 18:32
  • Are you doing this on Windows? AFAIK Microsoft is the only one to support inline schemas. –  Mar 23 '12 at 22:13
  • Doing this on Linux and inline is less important to me anyway. – Jono Mar 23 '12 at 22:30
  • @Jono Perhaps you don't. It might help if you asked a concrete question, rather than a completely general one. – Marcin Mar 24 '12 at 08:03
  • I'm pretty sure `lxml` doesn't support inline schemas. [Not many parsers do](http://msdn.microsoft.com/en-us/library/aa302288.aspx): "The W3C Schema Recommendation allows, but does not mandate, support for inline schemas. Few other XML Schema implementations besides those by Microsoft actually do support inline schemas." – Katriel Mar 24 '12 at 14:24
  • Given that, I'm not sure I understand the question. There's no _official_ way of giving the schema of a document. You just get it from somewhere. Sometimes it's inline, but that's not often supported. – Katriel Mar 24 '12 at 14:28
  • You can reference to the schema in the XML document, but I'm not sure how that's treated because AFAIK you need a validator to actually do the validation. –  Mar 24 '12 at 15:09
  • Try the validate_xml function in the following link: [XSD Validation](http://stackoverflow.com/a/40171364/7051977) – Spithas Oct 21 '16 at 08:19

2 Answers2

3

I have a project that has over 100 different schemas and xml trees. In order to manage all of them and validate them i did a few things.

1) I created a file (i.e. xmlTrees.py) where i created a dictionary of every xml and corresponding schema associated with it, and the xml path. This allowed me to have a single place to get both xml & the schema used to validate that xml.

MY_XML = {'url':'/pathToTree/myTree.xml', 'schema':'myXSD.xsd'}

2) In the project we have equally as many namespaces (very hard to manage). So what i did was again i created a single file that contained all the namespaces in the format lxml likes. Then in my tests and scripts i would just always pass the superset of namespaces.

ALL_NAMESPACES = {
    'namespace1':  'http://www.example.org',
    'namespace2':  'http://www.example2.org'
}

3) For basic/generic validation i ended up creating a basic function i could call:

    def validateXML(content, schemaContent):

    try:
        xmlSchema_doc = etree.parse(schemaContent);
        xmlSchema = etree.XMLSchema(xmlSchema_doc);
        xml = etree.parse(StringIO(content));
    except:
        logging.critical("Could not parse schema or content to validate xml");
        response['valid'] = False;
        response['errorlog'] = "Could not parse schema or content to validate xml";

    response = {}
    # Validate the content against the schema.
    try:
        xmlSchema.assertValid(xml)
        response['valid'] = True
        response['errorlog'] = None
    except etree.DocumentInvalid, info:
        response['valid'] = False
        response['errorlog'] = xmlSchema.error_log

    return response

basically any function that wants to use this needs to send the xml content and the xsd content as strings. This provided me with the most flexability. I then just placed this function in a file where i had all my xml helper functions.

Jtello
  • 752
  • 1
  • 5
  • 17
  • 1
    This does not answer my question because you are defining an explicit mapping of XML documents to schemas. The point of my question is how one can infer mappings. – Jono Apr 04 '12 at 20:01
  • The only way to really infer the mapping is to creating some type of mapping unfortunately. unless in the definition of the schemas you can get that url and actually retrieve the xsd file, or in each schema you add a comment of the location of the schema, which is still creating a mapping and not inferring, you basically can't. – Jtello Apr 06 '12 at 22:39
  • 1
    Yes, in my sample above I am using schemaLocation to make a reference. But that is only one way to refer to it inline. There are many other ways to do it inline (ie - below the root node), but I cant find a library that will parse and validate all these cases. – Jono Apr 06 '12 at 22:51
1

You could extract the schemas yourself and import them into a root schema:

from lxml import etree

XSI = "http://www.w3.org/2001/XMLSchema-instance"
XS = '{http://www.w3.org/2001/XMLSchema}'


SCHEMA_TEMPLATE = """<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns="http://dummy.libxml2.validator"
targetNamespace="http://dummy.libxml2.validator"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.0"
elementFormDefault="qualified"
attributeFormDefault="unqualified">
</xs:schema>"""


def validate_XML(xml):
    """Validate an XML file represented as string. Follow all schemaLocations.

    :param xml: XML represented as string.
    :type xml: str
    """
    tree = etree.XML(xml)
    schema_tree = etree.XML(SCHEMA_TEMPLATE)
    # Find all unique instances of 'xsi:schemaLocation="<namespace> <path-to-schema.xsd> ..."'
    schema_locations = set(tree.xpath("//*/@xsi:schemaLocation", namespaces={'xsi': XSI}))
    for schema_location in schema_locations:
        # Split namespaces and schema locations ; use strip to remove leading
        # and trailing whitespace.
        namespaces_locations = schema_location.strip().split()
        # Import all found namspace/schema location pairs
        for namespace, location in zip(*[iter(namespaces_locations)] * 2):
            xs_import = etree.Element(XS + "import")
            xs_import.attrib['namespace'] = namespace
            xs_import.attrib['schemaLocation'] = location
            schema_tree.append(xs_import)
    # Contstruct the schema
    schema = etree.XMLSchema(schema_tree)
    # Validate!
    schema.assertValid(tree)

BTW, your simpletest.xsd is missing the targetNamespace.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org" elementFormDefault="qualified">
    <xs:element name="name" type="xs:string"/>
</xs:schema>

With the code above, your example document validates against this schema.

Mathias Loesch
  • 373
  • 1
  • 5
  • 15