2

I have a simple XML file with an XSD schema, where some elements are allowed to contain only certain elements, e.g.

<xsd:element name="day" type="xsd:date"/>
<xsd:element name="interval">
    <xsd:complexType>
        <xsd:sequence>
            <xsd:element ref="day" minOccurs="2" maxOccurs="2"/>
        </xsd:sequence>
    </xsd:complexType>
</xsd:element>

and the XML code:

<interval>
    <day>2016-08-21</day>
    <day>2016-10-21</day>
</interval>

If within the interval tags I type anything but whitespace or day, it will (correctly) fail to validate. Now, using lxml in python, I extracted the canonical version (C14N) of such XML, and I found that the whitespace (those 4 spaces of indentation) were kept (as the standard says).

I need then to digitally sign this document, but I do not understand why would anyone sign that whitespace. It seems a weakness to me: different indentation implies different canonical XML (and mismatching signatures); but this is an unambiguous case in which that whitespace has nothing to do with the represented data (all the more so as the schema would not validate against any meaningful content).

  • Why is that whitespace part of a canonical representation of an XML involved in digital signatures?
  • Is there any way of enforcing in the XSD the removal of such useless whitespace?

I was thinking more specifically of the whiteSpace facet. By specifying collapse the whitespace should be removed on validation; but it seems that whiteSpace cannot be applied to a complexType, and I could not find a way of combining it with a sequence.

  • Can I apply the whiteSpace facet to a complexType (element only) node?
Pietro Saccardi
  • 2,602
  • 34
  • 41

2 Answers2

2

Why is that whitespace part of a canonical representation of an XML involved in digital signatures?

It's difficult to answer "why" questions, even if you were a member of the working group that published the spec (which I wasn't). I don't know why the spec authors made that decision, but I imagine that a decision either way would inconvenience some users at the expense of others.

Is there any way of enforcing in the XSD the removal of such useless whitespace?

Whitespace between elements in element-only content models is not considered significant in the PSVI. If you want to physically remove it, a practical way to do this is by copying the validated document with a schema-aware XSLT or XQuery processor, for example

java net.sf.saxon.Query -s:input.xml -xsd:input.xsd -val:strict -qs:.

(The query "." here returns the input document after validation).

Can I apply the whiteSpace facet to a complexType (element only) node?

No, and you don't need to.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
2

The following information was supplied by Pietro Saccardi in an edit to my answer, which I have separated out so that I do not appear to be the author.

In python with lxml there is a remove_blank_text option which would strip it when parsing:

parser = etree.XMLParser(remove_blank_text=True)
xml = etree.parse('file.xml', parser=parser)

MHK observation (from the documentation):

Note that the remove_blank_text option also uses a heuristic if it has no definite knowledge about the document's ignorable whitespace. It will keep blank text nodes that appear after non-blank text nodes at the same level. This is to prevent document-style XML from losing content.

This implies that remove_blank_text is not looking at a schema or DTD to identify element-only content, it is guessing from the instance data. The implication is that it might remove whitespace from an element like

<padding>    </padding>

that has simple content rather than element-only content.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164