4

One can use lxml to validate XML files against a given XSD schema.

Is there a way to apply this validation in a less strict sense, ignoring all elements which contain special expressions?

Consider the following example: Say, I have an xml_file:

<foo>99</foo>
<foo>{{var1}}</foo>
<foo>{{var2}}</foo>
<foo>999</foo>

Now, I run a program on this file, which replacing the {{...}}-expressions and produces a new file:

xml_file_new:

<foo>99</foo>
<foo>23</foo>
<foo>42</foo>
<foo>999</foo>

So far, I can use lxml to validate the new XML file as follows:

from lxml import etree
xml_root = etree.parse(xml_file_new)
xsd_root = etree.parse(xsd_file)
schema = etree.XMLSchema(xsd_root)
schema.validate(xml_root)

The relevant point in my example is that the schema restricts the <foo> contents to integers.

It would not be possible to apply the schema on the old xml_file in advance, however, as my program does some other expensive tasks, I would like to do exactly that while ignoring all lines containing any {{...}}-expressions:

<foo>99</foo>       <!-- should be checked-->
<foo>{{var1}}</foo> <!-- should be ignored -->
<foo>{{var2}}</foo> <!-- should be ignored -->
<foo>999</foo>      <!-- should be checked-->

EDIT: Possible solution approach: One idea would be to define two schemas

  • a strict second schema for the new file, allowing only integers
  • a relaxed schema for the old file, allowing both integers and arbitrary strings with {{..}}-expressions

However, to avoid the redundant task of keeping two schemas synchronized, one would need a way to generate the relaxed from the strict schema automatically. This sounds quite promising, as both schemas have the same structure, only differing in the restrictions of certain element contents. Is there a simple concept offered by XSD which allows to just "inherit" from one schema and then "attach" additional relaxations to individual elements?

Meyer
  • 1,662
  • 7
  • 21
  • 20
flonk
  • 43
  • 1
  • 4
  • I don't think that is possible without changing the XML or the schema. Since you can't change the XML, are you open to changing the schema? Because you could define a union type for ``, which allows *either* integers *or* `{{var...}}`. – Meyer Dec 21 '16 at 19:04
  • Thanks, I already to it this way. However, I still want to use the strict integer check *after* the substitutions and forbid `{{...}}`-expressions. The first check is just to pre-detect problems for time saving purposes, the second check matters. To make sure that `{{var1}}` is replaced by an integer in the new file also requires to rule out cases where `{{var1}}` is replaced by `{{var3}}` in the *new* file. Here using the same schema for both checks would give a false positive. – flonk Dec 21 '16 at 20:36
  • On the other hand, I fear that using two schemas (a relaxed first one and a stricter second one), leads much redundancy, especially as I need to change and update the schema very frequently. – flonk Dec 21 '16 at 20:39
  • @Meyer Please see my edit. – flonk Dec 21 '16 at 20:52

2 Answers2

3

To answer the edited question, it is possible to compose schemas with the xs:include (and xs:import) mechanism. This way, you can declare common parts in a common schema for reuse, and use dedicated schemas for specialized type definitions, like so:

The common schema that describes the structure. Note that it uses FooType, but does not define it:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <!-- Example structure -->
  <xs:element name="root">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="foo" type="FooType" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

The relaxed schema to validate before the replacement. It includes the compontents from the common schema, and defines a relaxed FooType:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:include schemaLocation="common.xsd"/>

  <xs:simpleType name="FooType">
    <xs:union memberTypes="xs:integer">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:pattern value="\{\{.*\}\}"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:union>
  </xs:simpleType>

</xs:schema>

The strict schema to validate after the replacement. It defines the strict version of FooType:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:include schemaLocation="common.xsd"/>

  <xs:simpleType name="FooType">
     <xs:restriction base="xs:integer"/>
  </xs:simpleType>

</xs:schema>

For completions sake, there also are alternative ways to do this, for example with xs:redefine (XSD 1.0) or xs:override (XSD 1.1). But these have more complex semantics and personally, I try to avoid them.

Stefano Munarini
  • 2,711
  • 2
  • 22
  • 26
Meyer
  • 1,662
  • 7
  • 21
  • 20
  • Thanks for your explicit example. I think it does not yet solve the redundancy, as the strict restrion of `` to `` should only happen *once*. Can your solution by improved in this sense? I think of something like "attaching" relaxations to the strict schema. I am quite unfamiliar with XSD, but in terms of inheritance, I would rather *derive* `relaxed` from `strict` instead of deriving both from an abstract base `common`. – flonk Dec 22 '16 at 08:53
  • What I like about your answer is that it already achieves a clear separation between the restrictions of structure and contents. – flonk Dec 22 '16 at 08:54
  • @flonk, the problem is that [simple types](http://www.w3.org/TR/xmlschema-1/#declare-datatype) can only be restricted, not extended. So you cannot go from strict to relaxed. – Meyer Dec 22 '16 at 09:47
  • OK, I see. My point was to avoid the common base and instead have a *direct* inheritance to avoid repeating the integer-restriction, because this is a common property of both schemes. So in my phrase "going from strict to relaxed" the direction was not relevant, "going from relaxed to strict" would also solve my problem. So if you say the latter would be possible, could you derive the strict from the relaxed scheme by just adding the `\{\{.*\}\}`-restriction? – flonk Dec 22 '16 at 10:45
  • The other way around is difficult as well. AFAIK, it is not possible to restrict a union type back to its individual components. One possiblity is to use a regular expression instead of a union, then you could restrict that. But any restriction will again contain a repetition of the integer type. I will add another answer with an alternative solution. – Meyer Dec 22 '16 at 12:23
0

Just with plain XSD, I do not know any way to avoid a redundant declaration of the integer type. However, as an alternative, you could adjust the schema within Python.

A simple way is this, using only one schema document (relaxed as default):

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:element name="root">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="foo" type="FooType" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:simpleType name="FooType">
    <xs:union memberTypes="xs:integer">
      <xs:simpleType id="RELAXED">
        <xs:restriction base="xs:string">
          <xs:pattern value="\{\{.*\}\}"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:union>
  </xs:simpleType>

</xs:schema>

In Python, you can then simply remove the element with id="RELAXED" to create the strict schema:

from lxml import etree

xsd_tree = etree.parse("relaxed.xsd")
xml_tree = etree.parse("test.xml")

# Create default relaxed schema
relaxed_schema = etree.XMLSchema(xsd_tree)

# Remove RELAXED element to create strict schema
pattern = xsd_tree.find(".//*[@id='RELAXED']")
pattern.getparent().remove(pattern)
strict_schema = etree.XMLSchema(xsd_tree)

print("Relaxed:", relaxed_schema.validate(xml_tree))
print("Strict:", strict_schema.validate(xml_tree))

Of course, with Python you could do this in many different ways. For example, you could also dynamically generate a xs:union element and insert it into a strict version of the schema. But that will get a lot more complex.

Meyer
  • 1,662
  • 7
  • 21
  • 20